## Problem Statement

When engaging with content in our model world, corporations use recommender system to suggest future content we might like based on the attributes of the content. Be it movies or books that we may be interested in with the goal of keeping customer interested andf However this may cause an echo chamber effect.

An extreme example of an echo chamber caused by recommender system is youtube's algorithm suggesting alt-right content which may lead to an individual developing extremist views. And this a problem of overtraining, because what is recommended is based on the data that is inputed to the system. 

Spotify uses a common method for producing recommendations known as collaborative filtering which generates recommendations based on the combined preferences of the consumer requesting recommendations and those of other consumers. The underlying issue with this method is that song recommendations are based on the 'crowd'.

Furthermore, as the business model of Spotify is built in a way where artists are compensated by number of streams, the homegenization of new music become more prevalent. Such as making songs more catchier and shorter. 

The recommender system I would like to propose is for users to be recommended songs based on the 'DNA' of the music they have been listening to with the goal of potentially exposing the listener to songs form different genres and epochs. This is threfore a cluster or nearest neighbour recommendation system. This would broaden the horizon of music listeners and also bring attention to artists that create music as an artform as opposed to achieving the highest streaming numbers. 

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline

from sklearn.manifold import TSNE

plt.style.use('fivethirtyeight')

# Datasets for Artist information (from Spotify Songs Kaggle dataset)

In [2]:
artist_df = pd.read_csv('../data/spotify_songs_1922/artists.csv')
data_by_artist_df = pd.read_csv('../data/spotify_songs_1922/data_by_artist_o.csv')
tracks_df = pd.read_csv('../data/spotify_songs_1922/tracks.csv')
data_df = pd.read_csv('../data/spotify_songs_1922/data_o.csv')
data_by_year_df = pd.read_csv('../data/spotify_songs_1922/data_by_year_o.csv')
data_by_genres_df = pd.read_csv('../data/spotify_songs_1922/data_by_genres_o.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../data/spotify_songs_1922/artists.csv'

## Decisions Moving Forward
- Using non explicit songs as there are more 
- consider making a decade categorical feature
- clustering genres together as genres are are not generalised


## Modelling:

### Feature Engineering

- using K-means(or other clustering techniques) to cluster genre together 
- should also look into other clustering methods


- what if you use TFIDF on genre feature considering that it is text data (tfidf gives more weight on words that are unique to a datapoint, if the word 'hiphop' is unique to a song's genre name, the model would 'clsuter' them together as it views it as similar. Less weight is given ). Countvectorizer would not be useful, in terms of application it would be better in looing for frequency of phrases etc.
- in TFIDF you are calculating the inverse log freqeuncy of a word in a sentence divided by number sentences, times the the frequency of the word in its sentence. However in this context, sentences are the song genres in a given playlist.
- if a genre is chinese pop, the word pop wouldnt have the most weight to it, but the word 'chinese' would have higher weight. This could potentially be a meta data of its own. 


- find a way for songs to not be suggested based on what's current. let songs that were decades old be be recommended. Decide how to manipulate that weight of the feature.



### Diferent types of Feature Selection/Extraction

- Using SVD (Singular Vector Decompostion) "Matrix Decomposition"
- Using PCA to select the most important variables

### Metrics

- Unsure if the metric is subject to the listener


### Building the Recommender System
- Naive and non personalised way by recreating the Collab Filtering as a baseline
- using distance to build a recommender system via clustered genres
- Using CNN as a recommender system
    - emulating this research paper https://www.sciencedirect.com/science/article/pii/S1877050919310646/pdf?md5=4f9a5242eb223b5c96c9ebf130855467&pid=1-s2.0-S1877050919310646-main.pd
- using cosine similarity

### Potential Limitations and Considerations

- kaggle dataset does not have user engagement and only genre metadata

 - do i need to use the million dataset as i need a rating system to work on?
     - current spotify dataset may not have enough features
     - Source of dataset: http://millionsongdataset.com/
     - This article uses it https://towardsdatascience.com/how-to-build-a-simple-song-recommender-296fcbc8c85
    
     
     
 - Show why Collab filtering is not ideal due to the the cold start problem, also suggesting niche recommendation, requires reference from what may be a scare dataset. Generally why collab filtering would create an echo chamber because popularity of the music can be affected exogenous factors such as hype around the artist, your social group and so on. 
 
 

#### helpful research links

- https://www.nytimes.com/2009/10/18/magazine/18Pandora-t.html
- https://towardsdatascience.com/4-ways-to-supercharge-your-recommendation-system-aeac34678ce9
- https://soundcharts.com/blog/music-industry-trends#the-democratization-of-music-creation

## Data Dictionary

## Datasets Description

#### artist_df
- shows popularity
- shows number of followers
- shows genre

#### data_by_artist_df
- 28680 artists
- gives average of the songs metadeta made by the artist. 
- Think of as the music DNA of the artist
- Also provides a list of genre the artist is categorised in.
 
#### tracks_df (songs)
- 586672 songs
- Does not have genre IMPORTANT
- has repeats of the same song with the same specs, possibly rereleased.
- does not have year or release date
- has song metadata.

#### data_df (songs)
- 170653 songs
- does not have genre
- has release dates and year in the dataset
- has song meta deta

#### data_by_year
- shows average of music data that year
- Useful for EDA
    - show trend of how music has evolved over the years
    - plot graph

## Check for duplicates

In [None]:
# check for Ashnikko result for the Song 'Daisy' in data_df dataset
data_df[(data_df.artists == "['Ashnikko']") & (data_df.name == "Daisy")]

In [None]:
# check for Ashnikko result for the Song 'Daisy' in track_df dataset
tracks_df[(tracks_df.artists == "['Ashnikko']") & (tracks_df.name == "Daisy")]

comments:
tracks_df may have duplicates, one example thus far is difference in release dates. 

Comparing the same song between data_df and tracks_df, they share the same ID for one of them however certain attribute have different values, such as popularity.

In [None]:
# check for Drake result for the Song 'Daisy' in data_df dataset
data_df[(data_df.artists == "['Drake']") & (data_df.name == "Best I Ever Had")]

In [None]:
# check for Drake result for the Song 'Daisy' in tracks_df dataset
tracks_df[(tracks_df.artists == "['Drake']") & (tracks_df.name == "Best I Ever Had")]

comments: For the example above, the same song had more than 1 output. For Data_df their release year and date is different as well as some attributes such as 'danceability'. This may indicate that the song was release twice, perhaps the latter was remastered and released in an album and not just as a single like the first time. Tracks_df does not have release date that information. 

data_df seem to not have the song non-explicit version of the song as well. 

Will investigate on more time. With the classic song "Here Comes the Sun" by the Beatles.

In [None]:
# check for Beatles result for the Song 'Here Comes the Sun' in tracks_df dataset
tracks_df[(tracks_df.artists == "['The Beatles']") & (tracks_df.name == "Here Comes The Sun")]

In [None]:
# check for Beatles result for the Song 'Here Comes the Sun' in data_df dataset
data_df[(data_df.artists == "['The Beatles']") & (data_df.name == "Here Comes The Sun")]

comment: No song to be found, however the remastered version is available.

In [None]:
# check for Beatles result for the Song 'Here Comes the Sun' in tracks_df dataset
tracks_df[(tracks_df.artists == "['The Beatles']") & (tracks_df.name == "Here Comes The Sun - Remastered 2009")]

In [None]:
# check for Beatles result for the Song 'Here Comes the Sun' in data_df dataset
data_df[(data_df.artists == "['The Beatles']") & (data_df.name == "Here Comes The Sun - Remastered 2009")]

Comments: For this song only the remastered version is available on spotify, and they still appear to have different characteristics, such as popularity, duration and energy.

Difference in popularity can be explained by the song being in an alubm that may have not been marketed as well or listeners feel that that album does not have other songs they would enjoy. For example the song that had a lower popularity score belonged to the album 'The Beatles 1967-1970' whereas the song with the higher popularity had belonged to the album 'Abbey Road (remasted)' where its a bigger compliation of classic by the band. 


#### Conclusion

Datasets that would be used for EDA would be data_by_year_df.

Dataset that would be used would be used for modelling would be data_df as it has more information compared to track_df. During comparison, useful information had also reveal itself such as that non exact 'duplicate' of the same song, are rereleased or remastered. This may affect their popularity, but moving forward, I believe that the granualirity of the popularity does not matter and I would threfore bucket them, making this feature categorical. 

In terms of duplicates as one song is the remastered version of the other. By definition, remastering music is essentially improving on the quality of the original copy of a song or album. Removing flaws from the music, providing a cleaner, sharper and more refined listening experience whilst trying to bringing the music up to date with current standard. After remastering, certain attributes of the song would still generally remain the same, such as key, time signature, duration, tempo and speechiness. Nonetheless these duplicates, I would not remove them, as there are some music enthusiast that actually prefered 'non tampered' music.

Also, there are some artist that generally do not make explicit music, and would therefore not have duplicates of the same song where they only differ in explicity. Removing duplciates by only keeping explicit songs would therefore result in removing many artists.


## Data Cleaning

In [None]:
## data

In [None]:
data_df.artists.head()

In [None]:
data_df.artists[0] #show artist name positioned at index 1

In [None]:
data_df.artists[0][0] #check to see if it is a list

In [None]:
data_df.loc[155464]

As we can see from above, it was a list, 'Sergio Racmaninoff' would have been returned.

artist and genre columns, they are actually not list, but strings that look like list. 
Use regex to solve this problem. 

## Exploratory Data Analysis

In [None]:
data_df.info()

In [None]:
tracks_df.info()

In [None]:
artist_df.popularity.sort_values(ascending=False)

In [None]:
artist_df.iloc[144481,:]

comment: Justin Bieber is the most popular artist

In [None]:
artist_df.info()

In [None]:
data_df.shape

In [None]:
data_df.head()

In [None]:
tracks_df.shape

In [None]:
tracks_df.info()

In [None]:
data_df.info()

In [None]:
for x in data_df.columns:
    if x not in tracks_df.columns:
        print (x)

In [None]:
tracks_df.info()

### creating scaled dataframe for eda

In [None]:
lin_graph = data_by_year_df.iloc[:,2:]
ss = StandardScaler()
lin_graph_data = ss.fit_transform(lin_graph)
#lin_graph_data = pd.concat([lin_graph,data_by_year_df.year],axis=1)

In [None]:
lin_graph_data = pd.DataFrame(data = lin_graph_data, columns=data_by_year_df.iloc[:,2:].columns.values)

In [None]:
graph_data = pd.concat([lin_graph_data,data_by_year_df.year],axis=1)

In [None]:
graph_data

In [None]:
graph_data[["acousticness","danceability","energy", 
         "instrumentalness", "liveness", "valence","year"]].set_index('year').plot(kind='line',figsize=(10,10))

comments: General EDA, not very useful in producing meaningful insight. Moving forward, I will explore more trends related to the problem satement

## More Ideas for EDA

- show how duration of music has shorten over time
- show relation to popular music and short duration, indicative of our shortened attention span
- show how prevalence of genre has affected our music taste
- show how genre populartiy has changed

### show how duration of music has shorten over time

In [None]:
#plot the duration of songs in minute across time
plt.figure(figsize=(8,8))
(data_by_year_df.groupby('year')['duration_ms'].mean()/60000).plot() #group data by year, take the average duration per year and plot
plt.title('Duration of Music Decline Overtime')
plt.savefig('../iamges/duration-of-music-decline-overtime.png');

In [None]:
(data_by_year_df.groupby('year')['duration_ms'].mean()/60000)

In [None]:
#plot the duration of songs relationship with popularity
plt.figure(figsize=(8,8))
sns.scatterplot(data =(data_df.groupby('popularity')['duration_ms'].mean()/60000))
plt.title('Popular Music Rating > 50 Get More Popular as Song Duration Decrease')
plt.ylabel('song duration (minutes)');

shorter duration in songs shows that artist or incentivised to make shorter songs for more plays as they are paid per stream. Can also reflect music listener's shortened attentions span. 

Furthermore songs tend to be more popular, past the 50 mark, with shorter duration. 

In [None]:
plt.figure(figsize=(8,8))
data_by_year_df.groupby('year')['loudness'].mean().plot() #group data by year, take the average duration per year and plot
plt.title('Loudness Increase Overtime');


-  Loudness, the inherent volume of the music itself, before any adjustments by the listener.
- dynamic range becomes much more restricted. As in, the contrast between the really soft stuff, and the really loud stuff shrinks, so the overall emotional impact of the music is reduced. Check out this video for a vivid demonstration of what this sounds like.

In [None]:
data_by_year_df.key.unique()

In [None]:
data_df.info()

In [None]:
#convert key data into string 
data_df.key = data_df.key.apply(lambda x: str(x))

In [None]:
data_df.groupby('year')['key'].unique()

In [None]:
#show value count of keys in proportion to overall count for spread of data
data_df[data_df['year'] == 2020]['key'].value_counts(normalize =True)

In [None]:
sample = data_df[data_df['year'] == 2020]['key'].value_counts(normalize =True)

In [None]:
len(sample)

In [None]:
#calculate the variance to return a single value
((sample - sample.mean())**2).sum()/(len(sample)-1)

In [None]:
#test for a different year
sample2 = data_df[data_df['year'] == 2005]['key'].value_counts(normalize =True)

In [None]:
((sample2 - sample2.mean())**2).sum()/(len(sample2)-1)

In [None]:
#list of variance per year
variance = []

for year in data_df.year.unique():
    #create normalized value count
    normal_data = data_df[data_df['year'] == year]['key'].value_counts(normalize =True)
    # calculate variance
    variance_val = ((normal_data - normal_data.mean())**2).sum()/(len(normal_data)-1)
    #append cariance to list
    variance.append(variance_val)
        

In [None]:
plt.figure(figsize=(8,8))
plt.title('Variance of Music (key) Decreaseing Over Time')
sns.lineplot(y=variance, x=data_df.year.unique());

Key or pitch includ details about harmony, melody, chords, and progressions – essentially how the notes were arranged and unfolded over the course of the song.

Data suggested that the variety of pitch progressions used has shrunk over the years. In other words, musicians are becoming less inventive and adventurous in how they get from one note or chord to the next, and instead seem to be relying more and more on the same sequences and patterns that others have used successfully in the past.

### summary of EDA thus far

#### music becoming more cookie cutter

-  pitch
    - the data suggested that the variety of pitch progressions used has shrunk over the years
    

- music getting louder 
    - everything is getting louder. Which might not seem like a big deal (just turn down the volume knob, right?), until you start to notice that when everything is louder, the dynamic range becomes much more restricted. As in, the contrast between the really soft stuff, and the really loud stuff shrinks, so the overall emotional impact of the music is reduced. 

### show the most popular month that music is released

In [None]:
data_df['month'] = pd.DatetimeIndex(data_df['release_date']).month

In [None]:
data_df.release_date.head(1000)

In [None]:
len(data_df.loc[999]['release_date'])

In [None]:
len(data_df.loc[996]['release_date'])


In [None]:
data_df.info()

In [None]:
data_df['month'] = None

In [None]:

for row in range(len(data_df)):
    if len(data_df.iat[row,16]) == 4:
        data_df.iat[row,19] = np.nan
    elif len(data_df.iat[row,16]) == 7:
        date = data_df.iat[row,16]
        date = datetime.datetime.strptime(date, "%Y-%m")
        data_df.iat[row,19] = date.month
    else:
        date = data_df.iat[row,16]
        date = datetime.datetime.strptime(date, "%Y-%m-%d")
        data_df.iat[row,19] = date.month


In [None]:
data_df.month.value_counts()

In [None]:
data_df.month.value_counts()

In [None]:
plt.figure(figsize=(8,8))
sns.lineplot(data=data_df.month.value_counts())
plt.title('Music Released in January the Highest');

Generally, the first two months of the year are a great time to release new music. Why? The market isn't as saturated as later on in the year, and the minds of listeners are open to new things.

## Clustering Genres Together

In this section, data_by_genres_df is used for clustering to see how the different genres would be clustered together based on their average song attributes, using differnt clustering methods. 

By using this dataset, we can see the different kinds of genre you could expect from a given cluster.

In [None]:
data_by_genres_df.sort_values(by='popularity',ascending=False).head(20)

In [None]:
data_by_genres_df["genres"].nunique()

In [None]:
data_by_genres_df.info()

In [None]:
data_by_genres_df.select_dtypes(np.number).head()

### Finding optimal K using elbow method

In [None]:
ss = StandardScaler()

# assign numerical data as train data
X = data_by_genres_df.select_dtypes(np.number)
X_scaled = ss.fit_transform(X)

In [None]:

K = range(1,10)
distortion = [] #the sum of square error for each data point to their nearest cluster centre
# using for loop
for k in K:
    inertia = KMeans(n_clusters=k).fit(X_scaled).inertia_
    distortion.append(inertia)



In [None]:
#plot for distortion against k range 50 to select best K

plt.plot(K, distortion,'bx-')

In [None]:
KMeans(n_clusters=20).fit(X_scaled).inertia_


wont be using the elbow method as 2 clusters for genre of music is not logical, will give my own K value or look into other clustering methods that makes more sense for just eyeballing it.

In [None]:
#reasign kmean's prediction
km_model = KMeans(n_clusters=30).fit(X_scaled)
grouped_genre = km_model.predict(X_scaled)

In [None]:
data_by_genres_df['grouped_genre'] = grouped_genre

In [None]:
data_by_genres_df.grouped_genre.value_counts()

In [None]:
data_by_genres_df[(data_by_genres_df.grouped_genre == 4)].head(30)

In [None]:
data_by_genres_df[(data_by_genres_df.genres == 'aggrotech')]

In [None]:
def scaler(data):
    ss = StandardScaler()
    return ss.fit_transform(data)

## Visualise Clustering Segmentation

What is tsne? 

Use to understand high-dimensional data and project it into low-dimensional space (2D).

Can be used in CNN.

In [None]:
# initialise TSNE
tsne = TSNE(n_components=2) #reduce to two dimensions
genre_embedding = tsne.fit_transform(X_scaled)
projection = pd.DataFrame(columns=['x','y'], data= genre_embedding) #put array into dataframe
projection['genre'] = data_by_genres_df['genres'] #assign data points original genres
projection['grouped_genre'] = data_by_genres_df['grouped_genre'] #assign data points cluster

In [None]:
def tsne_plot(data,title):
    plt.figure(figsize=(10,10))
    plt.title(title)
    plt.scatter(x='x', y='y', data=data, c ='grouped_genre', cmap='Dark2')

In [None]:
tsne_plot(projection,'Clusters with KMeans Algo where K=30')

Visually, the segementation of the data does not appear to be the best. May have to reconsider the clustering algorithm or the number of cluster. 

# Using Spotify API

## What ive learnt using the API



## Feature Selection


## Building Recommender

### Content-based filtering 

#### Advantages

The model doesn't need any data about other users, since the recommendations are specific to this user. This makes it easier to scale to a large number of users.
The model can capture the specific interests of a user, and can recommend niche items that very few other users are interested in.

#### Disadvantages

Since the feature representation of the items are hand-engineered to some extent, this technique requires a lot of domain knowledge. Therefore, the model can only be as good as the hand-engineered features.
The model can only make recommendations based on existing interests of the user. In other words, the model has limited ability to expand on the users' existing interests.

## Recommendations

Implicit
- or get implicit feedback on the number times a user skips recommended suggestion (not the most accurate as clicks can happen by accident)
- what is the ratio of songs added to the playlist against suggested
- the duration of the song you listen to
- how often the song is repeated provided its added into the playlist


Explicit
- have the user for explicit feedback whether they like the recommendation (perceived quality)