# Last.fm HetRec 2011 Dataset Preprocessing

### 1. Introduction

This notebook is part of a Master Thesis project on Music Recommender Systems at Universitat Pompeu Fabra. The dataset we will be working with is the Last.fm HetRec 2011 Dataset, which was released as part of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) (http://ir.ii.uam.es/hetrec2011). This workshop was held during the 5th ACM Conference on Recommender Systems (RecSys 2011) (http://recsys.acm.org/2011). The dataset was generated by the Information Retrieval Group at Universidad Autónoma de Madrid and is publicly available through the GroupLens research group here: https://grouplens.org/datasets/hetrec-2011/

The HetRec 2011 dataset contains detailed information on 92,800 artist listening records from 1,892 users, along with data on social networking and tagging activities from the Last.fm online music platform. This dataset provides a valuable foundation for exploring the complex interactions between users, artists, and the recommendation algorithms that connect them.

This notebook represents the initial phase of the project, where we will conduct an exploration of the dataset to identify the most relevant data for our research. The focus will be on understanding the dataset’s structure, cleaning the data, and selecting the features that will be most useful for addressing the research questions in the thesis.

To efficiently handle the computational requirements of this analysis, we are utilizing Google Colab, which offers powerful cloud-based resources. This setup not only enhances the reproducibility of the project but also ensures that resource-intensive tasks can be completed in a timely manner.

**References:**

Last.fm website, http://www.lastfm.com

Cantador, I., Brusilovsky, P., & Kuflik, T. (2011). 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). In Proceedings of the 5th ACM Conference on Recommender Systems (RecSys 2011)

### 2. Data Import and Preparation

Importing and extracting the files of the dataset. We will read the different files into pandas dataframes for easier manipulation and analysis.

In [None]:
# Import libraries
from google.colab import drive, files
import pandas as pd
import chardet
import zipfile
import os
import gc

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

# Specify path to the dataset zip file in Google Drive
thesis_folder = '/content/drive/My Drive/SMC_Thesis'
zip_file_path = thesis_folder + '/hetrec2011-lastfm-2k.zip'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# List all files in the directory
path_check = os.listdir(thesis_folder)
print(path_check)

['lastfm_unzipped_files']


In [None]:
# Sometimes Google Drive does not correctly mount, so we can upload the zip file from local
uploaded = files.upload()

Saving hetrec2011-lastfm-2k.zip to hetrec2011-lastfm-2k.zip


In [None]:
# Get the zip file name
zip_file_path = list(uploaded.keys())[0]

In [None]:
# Create directory to unzip the files
unzip_dir = '/content/drive/My Drive/SMC_Thesis/lastfm_unzipped_files'
os.makedirs(unzip_dir, exist_ok=True)

# Unzipping the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(unzip_dir)

print(f"Files extracted to {unzip_dir}")

Files extracted to /content/drive/My Drive/SMC_Thesis/lastfm_unzipped_files


In [None]:
# Open .dat files from dataset
dat_files = [f for f in os.listdir(unzip_dir) if f.endswith('.dat')]

In [None]:
# Empty dictionary to store DataFrames with their filenames
dataframes = {}

for file in dat_files:
    file_path = os.path.join(unzip_dir, file)

    # Detect the file's encoding, as there is different encoders used in the files
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
        encoding = result['encoding']

    try:
        # Read the file with the detected encoding
        df = pd.read_csv(file_path, sep='\t', header=0, encoding=encoding)
    except UnicodeDecodeError:
        # Change encoding if there's an issue
        df = pd.read_csv(file_path, sep='\t', header=0, encoding='ISO-8859-1')

    # Save the DataFrame in the dictionary with the filename as the key
    print(file)
    dataframes[file] = df

# Display the filenames to see if all of them got correctly read
for filename, dataframe in dataframes.items():
    print(f"File: {filename}")
    #print(dataframe.head())
    print("\n")

user_friends.dat
artists.dat
user_artists.dat
user_taggedartists.dat
user_taggedartists-timestamps.dat
tags.dat
File: user_friends.dat


File: artists.dat


File: user_artists.dat


File: user_taggedartists.dat


File: user_taggedartists-timestamps.dat


File: tags.dat




In [None]:
# Export individually all the dataframes from the dictionary, for easier access
user_friends = dataframes['user_friends.dat']
user_taggedartists = dataframes['user_taggedartists.dat']
user_taggedartists_timestamps = dataframes['user_taggedartists-timestamps.dat']
artists = dataframes['artists.dat']
tags = dataframes['tags.dat']
user_artists = dataframes['user_artists.dat']

In [None]:
# Release the dictionary to free RAM memory from Google Collab
del dataframes

# Run garbage collection to free up memory
gc.collect()

24

### 3. Dataset Information and Description

The dataset contains the following files and information:

**Data Statistics:**

*   1892 users
*   17632 artists
*   12717 bi-directional user friend relations, i.e. 25434 (user_i, user_j) pairs
        avg. 13.443 friend relations per user
*   92834 user-listened artist relations, i.e. tuples [user, artist, listeningCount]
         avg. 49.067 artists most listened by each user
         avg. 5.265 users who listened each artist
*   11946 tags
*   186479 tag assignments (tas), i.e. tuples [user, tag, artist]
         avg. 98.562 tas per user
         avg. 14.891 tas per artist
         avg. 18.930 distinct tags used by each user
         avg. 8.764 distinct tags used for each artist



**Files:**

*   artists.dat
          This file contains information about music artists listened and tagged by the users.
*   tags.dat
   
   	      This file contains the set of tags available in the dataset.

*   user_artists.dat
   
        This file contains the artists listened by each user.
        It also provides a listening count for each [user, artist] pair.

*   user_taggedartists.dat - user_taggedartists-timestamps.dat
   
        These files contain the tag assignments of artists provided by each particular user.
        They also contain the timestamps when the tag assignments were done.
   
*   user_friends.dat
   
   	    These files contain the friend relations between users in the database.



**Data Format:**

*   artists.dat --> [id, name, url, pictureURL]
        Example:
        707	Metallica	http://www.last.fm/music/Metallica	http://userserve-ak.last.fm/serve/252/7560709.jpg
*   tags.dat --> [tagID, tagValue]
        Example:
        1	metal
*   user_artists.dat --> [userID, artistID, weight]
        Example:
        2	51	13883
*   user_taggedartists.dat --> [userID, artistID, tagID, day, month, year]
        Example:
        2	52	13	1	4	2009  
*   user_taggedartists-timestamps.dat --> [userID, artistID, tagID, timestamp]
        Example:
        2	52	13	1238536800000
*   user_friends.dat --> [userID, friendID]
        Example:
        2	275

The previous information and examples can be found also here: https://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-readme.txt

In [None]:
# Displaying the top 3 rows of each dataframe to ensure the data was correctly imported
user_friends.head(3)

Unnamed: 0,userID,friendID
0,2,275
1,2,428
2,2,515


In [None]:
user_taggedartists.head(3)

Unnamed: 0,userID,artistID,tagID,day,month,year
0,2,52,13,1,4,2009
1,2,52,15,1,4,2009
2,2,52,18,1,4,2009


In [None]:
user_taggedartists_timestamps.head(3) # no need to keep this one

Unnamed: 0,userID,artistID,tagID,timestamp
0,2,52,13,1238536800000
1,2,52,15,1238536800000
2,2,52,18,1238536800000


In [None]:
artists.head(3)

Unnamed: 0,id,name,url,pictureURL
0,1,MALICE MIZER,http://www.last.fm/music/MALICE+MIZER,http://userserve-ak.last.fm/serve/252/10808.jpg
1,2,Diary of Dreams,http://www.last.fm/music/Diary+of+Dreams,http://userserve-ak.last.fm/serve/252/3052066.jpg
2,3,Carpathian Forest,http://www.last.fm/music/Carpathian+Forest,http://userserve-ak.last.fm/serve/252/40222717...


In [None]:
tags.head(3)

Unnamed: 0,tagID,tagValue
0,1,metal
1,2,alternative metal
2,3,goth rock


In [None]:
user_artists.head(3)

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351


### 4. Merging Dataframes

In this section, we will combine the individual dataframes containing user, artist, and tagging information. This merged dataframe will contain all relevant data, letting us to analyze the relationships between users, artists, and tags more effectively.

In [None]:
# Merge 'user_taggedartists' and 'user_friends' dataframes
merged_df = pd.merge(user_taggedartists, user_friends, on='userID', how='left')
merged_df.head()

Unnamed: 0,userID,artistID,tagID,day,month,year,friendID
0,2,52,13,1,4,2009,275
1,2,52,13,1,4,2009,428
2,2,52,13,1,4,2009,515
3,2,52,13,1,4,2009,761
4,2,52,13,1,4,2009,831


In [None]:
# Check the length of 'user_taggedartists to have control of the size of the merged dataframe
len(user_taggedartists)

186479

In [None]:
# Check the length of 'user_friends'
len(user_friends)

25434

In [None]:
# Check the length of 'merged_df'
len(merged_df)

2857144

In [None]:
# Group the friendID column so there is not a row for each friend,
#   to reduce the dataframe dimensions and making the final dataframe more maneagable
grouped_df = merged_df.groupby(['userID', 'artistID', 'tagID', 'day', 'month', 'year'])['friendID'].apply(list).reset_index()
grouped_df.head()

Unnamed: 0,userID,artistID,tagID,day,month,year,friendID
0,2,52,13,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
1,2,52,15,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
2,2,52,18,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
3,2,52,21,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
4,2,52,41,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."


In [None]:
# # Check the length of 'grouped_df' to control the size of the dataframe
len(grouped_df)

186479

In [None]:
# Free up space
del merged_df
del user_taggedartists
del user_friends

# Run garbage collection to free up memory
gc.collect()

0

In [None]:
# Merge previously merged dataframe with the tags
merged_df = pd.merge(grouped_df, tags, on='tagID', how='left')

# Rearrange columns so the 'tagValue' is next to 'tagID'
cols = merged_df.columns.tolist()
cols.insert(3, cols.pop(cols.index('tagValue')))
user_df = merged_df[cols]
user_df.head()

Unnamed: 0,userID,artistID,tagID,tagValue,day,month,year,friendID
0,2,52,13,chillout,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
1,2,52,15,downtempo,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
2,2,52,18,electronic,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
3,2,52,21,trip-hop,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
4,2,52,41,female vovalists,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."


In [None]:
# Release the dataframes that are no longer of use
del merged_df
del grouped_df
del tags

# Run garbage collection to free up memory
gc.collect()

0

In [None]:
# Merge previously merged dataframe with the tags with 'user_artists'
merged_df = pd.merge(user_df, user_artists, on=['userID','artistID'], how='left')

# Rearrange columns so the 'weight' is next to 'artistID'
cols = merged_df.columns.tolist()
cols.insert(2, cols.pop(cols.index('weight')))
user_df = merged_df[cols]
user_df.head()

Unnamed: 0,userID,artistID,weight,tagID,tagValue,day,month,year,friendID
0,2,52,11690.0,13,chillout,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
1,2,52,11690.0,15,downtempo,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
2,2,52,11690.0,18,electronic,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
3,2,52,11690.0,21,trip-hop,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
4,2,52,11690.0,41,female vovalists,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."


In [None]:
# As the 'artists' dataframe contains more information than we will use, create a smaller dataframe with necessary info
artists_merge =  artists[['id','name']]
artists_merge = artists_merge.rename(columns={'id': 'artistID'})
artists_merge.head()

Unnamed: 0,artistID,name
0,1,MALICE MIZER
1,2,Diary of Dreams
2,3,Carpathian Forest
3,4,Moi dix Mois
4,5,Bella Morte


In [None]:
# Merge previously merged dataframe with 'artists_merge'
merged_df = pd.merge(user_df, artists_merge, on='artistID', how='left')

# Rearrange columns so the artist name is next to artistID
cols = merged_df.columns.tolist()
cols.insert(2, cols.pop(cols.index('name')))
user_df = merged_df[cols]
user_df.head()

Unnamed: 0,userID,artistID,name,weight,tagID,tagValue,day,month,year,friendID
0,2,52,Morcheeba,11690.0,13,chillout,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
1,2,52,Morcheeba,11690.0,15,downtempo,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
2,2,52,Morcheeba,11690.0,18,electronic,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
3,2,52,Morcheeba,11690.0,21,trip-hop,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."
4,2,52,Morcheeba,11690.0,41,female vovalists,1,4,2009,"[275, 428, 515, 761, 831, 909, 1209, 1210, 123..."


### 5. Cleaning and Saving the Final Dataset

In this final section, we will inspect the final merged dataset for any null or NaN values and perform necessary cleaning to ensure data integrity. We will also format the dataset appropriately, preparing it for use in the next phase of the research. Once the cleaning and formatting are complete, the final dataset will be saved.

In [None]:
# Check for null values in the different columns in the final dataset
null_summary = user_df.isnull().sum()
print(null_summary)

userID           0
artistID         0
name          1538
weight      113121
tagID            0
tagValue         0
day              0
month            0
year             0
friendID         0
dtype: int64


In [None]:
# Filter the DataFrame to show rows where 'artistID' is null
null_artistID_df = user_df[user_df['name'].isnull()]
print(null_artistID_df)

        userID  artistID name  weight  tagID              tagValue  day  \
239          8     14103  NaN     NaN     24                   pop    1   
240          8     14103  NaN     NaN     30                german    1   
241          8     14103  NaN     NaN    130      female vocalists    1   
369          9     13785  NaN     NaN    141            visual kei    1   
1191        12     15189  NaN     NaN     81                 indie    1   
...        ...       ...  ...     ...    ...                   ...  ...   
186266    2096     18699  NaN     NaN    481                   usa    1   
186383    2099     11371  NaN     NaN     13              chillout    1   
186384    2099     11371  NaN     NaN     15             downtempo    1   
186385    2099     11371  NaN     NaN     21              trip-hop    1   
186386    2099     11371  NaN     NaN    758  instrumental hip-hop    1   

        month  year                                           friendID  
239        11  2006  [339,

In [None]:
# Replace NaN values in the 'artistName' column with an empty string
user_df['name'] = user_df['name'].fillna('')

# Verify that there are no more NaN values in the 'weight' column
print(user_df['name'].isnull().sum())  # Should print 0 if all NaNs were replaced

0


In [None]:
# Check where and how the null values are displayed in the dataset
null_rows_any = user_df[user_df.isnull().any(axis=1)]
print(null_rows_any)

        userID  artistID                name  weight  tagID          tagValue  \
27           2       995        China Crisis     NaN     16          new wave   
28           2       995        China Crisis     NaN     17         synth pop   
29           2       995        China Crisis     NaN     24               pop   
30           2       995        China Crisis     NaN     25               80s   
31           2       995        China Crisis     NaN     42         synth-pop   
...        ...       ...                 ...     ...    ...               ...   
186410    2099     16468     Clutchy Hopkins     NaN    191      instrumental   
186411    2099     16745             DJ Food     NaN     13          chillout   
186412    2099     16745             DJ Food     NaN     15         downtempo   
186413    2099     16745             DJ Food     NaN     21          trip-hop   
186465    2100      3855  Andrius Mamontovas     NaN   3271  melancholic rock   

        day  month  year   

In [None]:
# Replace NaN values in the 'weight' column with 0, as it means that the user has not listened to that artist
user_df['weight'] = user_df['weight'].fillna(0)

# Verify that there are no more NaN values in the 'weight' column
print(user_df['weight'].isnull().sum())  # Should print 0 if all NaNs were replaced

0


In [None]:
# Check for null values in each column
null_summary = user_df.isnull().sum()
print(null_summary)

userID      0
artistID    0
name        0
weight      0
tagID       0
tagValue    0
day         0
month       0
year        0
friendID    0
dtype: int64


In [None]:
# Save final dataframe as a csv
file_path = '/content/drive/My Drive/SMC_Thesis/final_dataframe.csv'
user_df.to_csv(file_path, index=False)