# Exploratory Data Analysis (EDA)

## Overview
Datasets are from Last.fm:
- `artists.dat`
- `tags.dat`
- `user_artists.dat` 
- `user_taggedartists.dat`
- `user_taggedartists-timestamps.dat`
- `user_friends.dat`

## Data Descriptions

1. **`artists.dat`**: contains information about music artists listened and tagged by the users.
   - columns: `id`, `name`, `url`, `pictureURL`

2. **`tags.dat`**: contains the set of tags available in the dataset.
   - columns: `tagID`, `tagValue`

3. **`user_artists.dat`**: contains the artists listened by each user, with a listening count for each [user, artist] pair.
   - columns: `userID`, `artistID`, `weight`

4. **`user_taggedartists.dat`**: contain the tag assignments of artists provided by each particular user wtih dates.
   - columns: `userID`, `artistID`, `tagID`, `day`, `month`, `year`

5. **`user_taggedartists-timestamps.dat`**: contain the tag assignments of artists provided by each particular user with the timestamps when the tag assignments were done.
   - columns: `userID`, `artistID`, `tagID`, `timestamp`

6. **`user_friends.dat`**: contain the friend relations between users in the database.
   - columns: `userID`, `friendID`

## Loading Data & Importing Libraries

In [5]:
# Load libraries
import pandas as pd
import os

In [13]:
# import datasets
artists = pd.read_csv(os.path.join('..','data','artists.dat'), delimiter='\t')
tags = pd.read_csv(os.path.join('..','data','tags.dat'), delimiter='\t',encoding='ISO-8859-1')
user_artists = pd.read_csv(os.path.join('..','data','user_artists.dat'), delimiter='\t')
user_friends = pd.read_csv(os.path.join('..','data','user_friends.dat'), delimiter='\t')
user_taggedartists_timestamps = pd.read_csv(os.path.join('..','data','user_taggedartists-timestamps.dat'), delimiter='\t')
user_taggedartists = pd.read_csv(os.path.join('..','data','user_taggedartists.dat'), delimiter='\t')

## Initial Analysis
We will check for missing values and look at the size and structure of each dataset.

### Artists Dataset

In [17]:
print("Artists dataset:")
print(artists.info())
print(artists.head())

Artists dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17632 entries, 0 to 17631
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          17632 non-null  int64 
 1   name        17632 non-null  object
 2   url         17632 non-null  object
 3   pictureURL  17188 non-null  object
dtypes: int64(1), object(3)
memory usage: 551.1+ KB
None
   id               name                                         url  \
0   1       MALICE MIZER       http://www.last.fm/music/MALICE+MIZER   
1   2    Diary of Dreams    http://www.last.fm/music/Diary+of+Dreams   
2   3  Carpathian Forest  http://www.last.fm/music/Carpathian+Forest   
3   4       Moi dix Mois       http://www.last.fm/music/Moi+dix+Mois   
4   5        Bella Morte        http://www.last.fm/music/Bella+Morte   

                                          pictureURL  
0    http://userserve-ak.last.fm/serve/252/10808.jpg  
1  http://userserve-ak.last.fm/serv

The `id` and `name` columns will be needed to match artists to their corresponding artist ID in our recommendations. We can see that the columns `url` and `pictureURL` are not relevant for our collaborative filtering model. hence, we will drop these. There are missing values in the `pictureURL` column but we are removing this so the missing values will not affect our model. After cleaning, this dataset will have 2 columns and 17,632 rows.

### Tags Dataset

In [18]:
print("Tags dataset:")
print(tags.info())
print(tags.head())

Tags dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11946 entries, 0 to 11945
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tagID     11946 non-null  int64 
 1   tagValue  11946 non-null  object
dtypes: int64(1), object(1)
memory usage: 186.8+ KB
None
   tagID           tagValue
0      1              metal
1      2  alternative metal
2      3          goth rock
3      4        black metal
4      5        death metal


Both the `tagID` and `tagValue` columns will be needed for our model. The `tagID` gives the unique tag and the `tagValue` provides a description of the tag, for example, 'metal' or 'rock' which could indicate that a user likes these genres of music. There are no missing values in this dataset. There are 2 columns and 11,946 rows.

### User-Artists Dataset

In [19]:
print("User-Artists dataset:")
print(user_artists.info())
print(user_artists.head())

User-Artists dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92834 entries, 0 to 92833
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    92834 non-null  int64
 1   artistID  92834 non-null  int64
 2   weight    92834 non-null  int64
dtypes: int64(3)
memory usage: 2.1 MB
None
   userID  artistID  weight
0       2        51   13883
1       2        52   11690
2       2        53   11351
3       2        54   10300
4       2        55    8983


The `userID` and `artistID` columns are used to identify the users and the artists that they interact with. The `weight` column represents the listening count, indicating how many times a user has listened to a specific artist. Clearly, the `weight` column will be important in building our recommender system, since the magnitude of the listening count can be used to determine a user's preference to listen to the given artist. There are no missing values in this dataset. There are 3 columns and 92,834 rows.

### User-Tagged Artists dataset

In [20]:
print("User-Tagged Artists dataset:")
print(user_taggedartists.info())
print(user_taggedartists.head())

User-Tagged Artists dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186479 entries, 0 to 186478
Data columns (total 6 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   userID    186479 non-null  int64
 1   artistID  186479 non-null  int64
 2   tagID     186479 non-null  int64
 3   day       186479 non-null  int64
 4   month     186479 non-null  int64
 5   year      186479 non-null  int64
dtypes: int64(6)
memory usage: 8.5 MB
None
   userID  artistID  tagID  day  month  year
0       2        52     13    1      4  2009
1       2        52     15    1      4  2009
2       2        52     18    1      4  2009
3       2        52     21    1      4  2009
4       2        52     41    1      4  2009


The `userID` and `artistID` columns connect users to the artists that they have tagged, with the `tagID` column linking a specific tag with the artist. The `day`, `month`, and `year` columns show the exact date when a user tagged an artist. There are no missing values. There are 6 columns and 186,479 rows. The `day`, `month`, and `year` columns are not essential for our collaborative model, however, we will keep them in place as we could possibly implement a time-based element into the model.

### User-Tagged Artists Timestamps dataset

In [23]:
print("User-Tagged Artists Timestamps dataset:")
print(user_taggedartists_timestamps.info())
print(user_taggedartists_timestamps.head())

User-Tagged Artists Timestamps dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186479 entries, 0 to 186478
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   userID     186479 non-null  int64
 1   artistID   186479 non-null  int64
 2   tagID      186479 non-null  int64
 3   timestamp  186479 non-null  int64
dtypes: int64(4)
memory usage: 5.7 MB
None
   userID  artistID  tagID      timestamp
0       2        52     13  1238536800000
1       2        52     15  1238536800000
2       2        52     18  1238536800000
3       2        52     21  1238536800000
4       2        52     41  1238536800000


This is essentially just the `user_taggedartists.dat` dataset, except the `day`, `month`, and `year` columns have been replaced by a single timestamp. This has no missing values and has the same number of rows but less columns than the previous dataset. Hence, considering computational efficiency, this dataset is likely more useful to us than the previous, and if we were to implement time-based elements into our model, we will prefer this.  

### User-Friends dataset

In [22]:
print("User-Friends dataset:")
print(user_friends.info())
print(user_friends.head())

User-Friends dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25434 entries, 0 to 25433
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    25434 non-null  int64
 1   friendID  25434 non-null  int64
dtypes: int64(2)
memory usage: 397.5 KB
None
   userID  friendID
0       2       275
1       2       428
2       2       515
3       2       761
4       2       831


The `userID` column identifies the user, and the `friendID` column links them to their friends on the platform. This will be useful in analysing social interactions in our model and make recommendations based on the a user's friend's preferences. There are 2 columns and 25,434 rows. There are no missing values.

### **Conclusion of Initial Analysis**
We have looked at the basic properties of each dataset. There are many useful features of the data, including listening counts, tagged artists and user friendships. We need to clean the data by removing irrelevant columns.

We will summarise how we will use each dataset in our collaborative filtering model:

- **Artists dataset**: This dataset contains information about artists. It will be used to link artists to their corresponding `artistID` in the user-related datasets. There are no missing values, but we will remove the `url` and `pictureURL` columns, as these will not be relevant for our model.

- **Tags dataset**: This dataset contains the tags associated with artists. It will be used to link `tagID` to tags for the artists. There are no missing values, however, we need to ensure there are no irrelevant tags and remove duplicates if present.

- **User-Artists dataset**: This dataset will be used to shows which artists each user listens to, with the `weight` column representing listening counts. There are no missing values, but we need to check for and remove duplicates if present.

- **User-Tagged Artists dataset**: This dataset will not be used, since we will use the **User-Tagged Artists Timestamps dataset** instead as this contains the same amount of data but in a more compact structure with less columns.

- **User-Tagged Artists Timestamps dataset**: This dataset gives timestamped interactions of users with tagged artists. It can be used for time-sensitive recommendations, however, we may not use timestamps in our model. We will keep the timestamp column regardless, in case we decide to use time-sensitive recommendations later. There are no missing values, but we need to check for and remove duplicates if present. When implementing a model without time-sensitive recommendations, we will drop the `timestamp` column after cleaning.

- **User-Friends dataset**: This dataset shows the friendships between users and will be used to model user similarity or influence. There are no missing values, but we need to check for and remove duplicates if present.

In summary, each dataset is mostly clean but will require checking for duplicates and potential removal of unnecessary columns. The main goal is to keep only the relevant data for the recommendation system.


## Data Cleaning

In [26]:
# Drop irrelevant columns from the Artists dataset
artists_cleaned = artists.drop(columns=['url', 'pictureURL']).drop_duplicates(keep='first') 

# Drop the irrelevant columns in the Tags dataset
tags_cleaned = tags.drop_duplicates(keep='first') 

# For the User-Artists dataset, we can filter out rows with a weight of 0, as they show no meaningful interaction
user_artists_cleaned = user_artists[user_artists['weight'] > 0]
user_artists_cleaned = user_artists_cleaned.drop_duplicates(keep='first') 

# Drop duplicates from the User-Tagged Artists Timestamps dataset
user_taggedartists_timestamps_cleaned = user_taggedartists_timestamps.drop_duplicates(keep='first') 

# Convert timestamps from ms to datetime format
user_taggedartists_timestamps_cleaned['timestamp'] = pd.to_datetime(user_taggedartists_timestamps_cleaned['timestamp'], unit='ms')

# Drop duplicates from the User-Friends dataset
user_friends_cleaned = user_friends.drop_duplicates(keep='first') 

# Output cleaned datasets for inspection
print("Cleaned Artists dataset:", artists_cleaned.info(), artists_cleaned.head())
print("Cleaned Tags dataset:", tags_cleaned.info(), tags_cleaned.head())
print("Cleaned User-Artists dataset:", user_artists_cleaned.info(), user_artists_cleaned.head())
print("Cleaned User-Tagged Artists Timestamps dataset:", user_taggedartists_timestamps_cleaned.info(), user_taggedartists_timestamps_cleaned.head())
print("Cleaned User-Friends dataset:", user_friends_cleaned.info(), user_friends_cleaned.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17632 entries, 0 to 17631
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      17632 non-null  int64 
 1   name    17632 non-null  object
dtypes: int64(1), object(1)
memory usage: 275.6+ KB
Cleaned Artists dataset: None    id               name
0   1       MALICE MIZER
1   2    Diary of Dreams
2   3  Carpathian Forest
3   4       Moi dix Mois
4   5        Bella Morte
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11946 entries, 0 to 11945
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   tagID     11946 non-null  int64 
 1   tagValue  11946 non-null  object
dtypes: int64(1), object(1)
memory usage: 186.8+ KB
Cleaned Tags dataset: None    tagID           tagValue
0      1              metal
1      2  alternative metal
2      3          goth rock
3      4        black metal
4      5        death metal
<class 'p

We can see that there were no duplicates but the it is good to clean in practice. We have now prepared the data to be analysed.

## Data Analysis