# Last.FM Artist Recommender System

### Data Engineering Capstone Project

#### Project Summary

![](https://www.last.fm/static/images/lastfm_logo_facebook.1b63d4451dcc.png)

Last.FM is one of Major Music Streaming Service Apps. 
To provide a personalized recommendation service, Last.FM stores information about the artists that customers have heard so far.

We will create a data pipeline that can provide personalized recommendation services using playlist information for each customer.

In [1]:
import pandas as pd

In [2]:
artist_df = pd.read_csv("./data/lastfm_artist.csv")
user_df = pd.read_csv("./data/lastfm_user.csv")
play_df = pd.read_csv("./data/lastfm_play.csv")

In [3]:
merge_df = (
    pd.merge(
        pd.merge(play_df, artist_df, on='artist_id'),
        user_df, on='user_id')
)

In [None]:
(
    merge_df.to_json("./playlists.json",
                     orient='records', 
                     lines=True)
)

### Download Dataset

In [None]:
!wget https://craftsangjae.s3.ap-northeast-2.amazonaws.com/data/lastfm_artist.csv -P data/
!wget https://craftsangjae.s3.ap-northeast-2.amazonaws.com/data/lastfm_user.csv -P data/
!wget https://craftsangjae.s3.ap-northeast-2.amazonaws.com/data/lastfm_play.csv -P data/

### Load Modules

In [None]:
%matplotlib inline
import os
import pandas as pd
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt

### Create Spark Session

In [None]:
# AWS Credentials
ACCESS_KEY = "AKIAJK7ZQUYBFAOV3MZQ"
SECRET_KEY = "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"

def create_spark_session():
    spark = (
        SparkSession.builder
        .config('spark.master', 'local')
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.3") \
        .getOrCreate())
    hadoop_conf = spark._jsc.hadoopConfiguration()
    hadoop_conf.set("fs.s3a.access.key", ACCESS_KEY)
    hadoop_conf.set("fs.s3a.secret.key", SECRET_KEY)
    hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
    hadoop_conf.set("fs.s3a.endpoint", "s3.ap-northeast-2.amazonaws.com")
    return spark

## Objective 

> Non-Personalized & Personalized Artist Recommendation System

There are three datasets including user demographic, artists and playlists. Through this datasets, we will buidl a data pipeline for two recommendation engines. 

First, as a non-personalized recommendation, we will collect Top 10 Artists by user's demographic.  

Second, as a personalized recommendation, ALS algorithms in Spark-ML will be used to select 10 artists recommended for each user.

## (Step 1) Data Description

Downloaded from URL([Music Recommendation Datasets for Research](http://ocelma.net/MusicRecommendationDataset/)).  This is a data set published for research, and is extracted from the actual service environment.

* `lastfm_play.csv` : Information about how many times each user listened to the artist's music
* `lastfm_user.csv` : User's personal information

* `lastfm_artist.csv` : artist's name

In [None]:
play_df = spark.read.csv(
    "data/lastfm_play.csv",
    inferSchema=True,
    header=True)

user_df = spark.read.csv(
    "data/lastfm_user.csv",
    inferSchema=True,
    header=True)

artist_df = spark.read.csv(
    "data/lastfm_artist.csv",
    inferSchema=True,
    header=True)

`play_df`

In [None]:
play_df.show(10)

`user_df`


In [None]:
user_df.show(10)

`artist_df`

In [None]:
artist_df.show(10)

## (Step 2) Exploratory data analysis 

### - `play_df` 

#### 1. the average number of artist music played per user

On average, users play music from about 50 artists.

In [None]:
(
    play_df
    .dropDuplicates(['user_id', 'artist_id'])
    .groupby('user_id')
    .count()
    .select(['count'])
    .toPandas()
    .hist(bins=30)
)

plt.show()

#### 2. the average number of users  played by  each artist

The graph is skewed to the left. You can see that some of the most popular artists and artists heard by more than 250 users are very few. Most artists have records heard by less than 50 users. Most artists are not receiving user's choice.


In [None]:
(
    play_df
    .dropDuplicates(['user_id', 'artist_id'])
    .groupby('artist_id')
    .count()
    .select(['count'])
    .toPandas()
    .plot(kind='hist',bins=2000, xlim=(0,2000))
)

plt.show()

#### the average number of plays

An artist that a user adds to a playlist, but rarely hears, can determine that the user doesn't like the artist

In [None]:
under5 = (
    play_df
    .filter(play_df.plays.between(0,5))
    .groupby('plays')
    .count()
    .toPandas())

under5

### - `user_df` 

#### `gender`  data quality check

Since gender information contains None, it should be removed when cleansing data.

In [None]:
(
    user_df
    .groupby('gender')
    .count()
    .show()
)

#### `age`  data quality check

Since age information contains -1, it should be removed when cleansing data. And the age group is divided into four groups( '<20', '20-30', '30-50', '>50')

In [None]:
ages = user_df.select('age').distinct().toPandas()

#### `country` data quality check

The number of countries is 239, and in the case of gambia, there are only 3 rows. Countries with fewer than 10000 users will be removed.

In [None]:
country_df = (
    user_df
    .groupBy("country")
    .count()
    .orderBy(desc('count'))
    .toPandas()
)

country_df

In [None]:
country_df[country_df['count']>=10000]

## (Step 3) Define the Data Model