### Aim and Motivation
We consider this as a proposal for new data science project which we believe will be important for our company to do well in the market. 
We define precisely in the report and briefly in the cells below, what is the business problem that the team wants to solve.
learn about the data science related business and research about the current markets: such as search, social media, advertisement, recommendation and so on. 
Why the problem is important to solve? 
Why we believe our team can make use of data science methodologies to solve the problem?
How are we planing to persuade the senior executives to buy in your idea.

**These are some of the business problems that we intend to solve through our project idea.**

1) Business problem to solve: Song streaming platforms such as Spotify and Apple Music all have their own song recommendation algorithms. They make use of the massive amounts of data that users provide, in addition to various data and information from the songs themselves.  There are many users who feel that there are a lack of playlists curated for specific ‘moods’ and ‘activites’. They may feel that there are not enough options, or that the songs being recommended to them aren’t exactly what they want. We want to provide this service and capture this share of the market. We believe that this particular feature isn’t being offered to a high level of quality in the market.


2) Why the problem is important to solve? - The problem is important to solve because we believe that this is a niche in the market that hasn’t quite been tapped into. This sort of convenience and specificity has never been provided to users before and doing so would result in increased revenue and the ability to carve a name for the company in an underdeveloped area. 


3) What is the idea to solve the problem? - Our idea to solve this problem is to train a model that focuses only on identifying what makes a song similar to another, and then recommends such songs to the user. Given that the user would define an “activity” and then offer a list of 20+ songs to the model, it would be able to generate multiply playlists for this user defined activity, all of which would be to the user’s tastes and similar to the user uploaded playlist


4) What differences we could make with our data science approach? - Companies like Spotify recommend songs to users based a combination of the following factors: lyrical content, language and song features, and past listening habits. This can be good for some purposes, but we feel that it lacks the specificity some users look for. There is a feature where spotify creates a ‘radio’ built around the songs you have in your playlists, but this is not a convenient set of playlists that you can share, listen to, and add to. Users would often times want activity or mood specific playlists curated based on their unique musical preferences. We believe that with our approach we take a step closer to meeting this type of need. 


5) Why do we believe the idea deserves the financial resources of a company? - If the company allocates resources into developing this model and building a product around it, we predict that there would great profits owing to increased user engagement and satisfaction. We believe that the feature we are discussing is not being offered in the market currently. In addition, the song recommendation systems that do exist work off of slightly different mechanisms. If the company invests in our idea, we believe that we would be able to spearhead a new kind of recommendation system, one where the user is in power, and where the specificity levels are higher than are currently available.




### Data Gathering, Exploratory Data Analysis, Make Conjectures

Gathering data, using appropriate data analysis techniques to explore the data. Making conejectures in relation the above problems. Finding support for our conjectures using the data.

In [467]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

In [468]:
import pandas as pd
cid = 'b66cd474c4914a779d3b8a275743308d'
secret = '3a4b5c746e3043a4aea8ce5752cbe7ec'
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [469]:
playlist_link = "https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF?si=1333723a6eff4b7f"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]
track_uris = [x["track"]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]

In [470]:
#importing regular expression library
import re

In [471]:
df["track_uri"] = df["track_uri"].apply(lambda x: re.findall(r'\w+$', x)[0])
df["track_uri"]

0        0UaMYEvWZi0ZqiDOoHU3YI
1        6I9VzXrHxO9rA9A5euc8Ak
2        0WqIKmW4BTrj3eJFmnCKMv
3        1AWQoqb9bSvzTjaLralEkT
4        1lzr43nnXAijIGYnCT8M8H
                  ...          
67498    5uCax9HTNlzGybIStD3vDh
67499    0P1oO2gREMYUCoOkzYAyFu
67500    2oM4BuruDnEvk59IvIXCwn
67501    4Ri5TTUgjM96tbQZd5Ua7V
67502    5RVuBrXVLptAEbGJdSDzL5
Name: track_uri, Length: 67503, dtype: object

In [472]:
def get_playlist_tracks(username,playlist_id):
    results = sp.user_playlist_tracks(username,playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

In [473]:
all_tracks = get_playlist_tracks('5catacombs', 'https://open.spotify.com/playlist/37i9dQZEVXbNG2KDcFcKOF?si=1333723a6eff4b7f')
all_track_ids = []
for i in all_tracks:
    all_track_ids.append(i['track']['id']) 

In [474]:
def get_track_features(id):
    metadata = sp.track(id)
    features = sp.audio_features(id)

    # metadata
    name = metadata['name']
    album = metadata['album']['name']
    artist = metadata['album']['artists'][0]['name']
    release_date = metadata['album']['release_date']
    duration = metadata['duration_ms']
    popularity = metadata['popularity']
    #genre = artist_info["genres"]

    # audio features
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    time_signature = features[0]['time_signature']

    track = [popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature]
    return track

In [475]:
tracks = []
for i in range(len(all_track_ids)):
  track = get_track_features(all_track_ids[i])
  tracks.append(track)
df2 = pd.DataFrame(tracks, columns = ['popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature'])
df2.head()

Unnamed: 0,popularity,danceability,acousticness,danceability.1,energy,instrumentalness,liveness,loudness,speechiness,tempo,time_signature
0,100,0.714,0.013,0.714,0.472,5e-06,0.266,-7.375,0.0864,131.121,4
1,97,0.637,0.13,0.637,0.643,2e-06,0.142,-6.571,0.0519,97.008,4
2,92,0.52,0.342,0.52,0.731,0.00101,0.311,-5.338,0.0557,173.93,4
3,97,0.835,0.583,0.835,0.679,2e-06,0.218,-5.329,0.0364,124.98,4
4,94,0.336,0.164,0.336,0.627,0.0,0.0708,-7.463,0.0384,150.273,4


### Model Development

In [1]:
import pandas as pd

In [2]:
# Import processed data
playlist_df = pd.read_csv("processed_data.csv")
print(playlist_df.columns)
playlist_df.drop(columns=["Unnamed: 0",'Unnamed: 0.1'], inplace = True)
playlist_df.head()

Index(['Unnamed: 0.1', 'Unnamed: 0', 'pos', 'artist_name', 'track_uri',
       'artist_uri', 'track_name', 'album_uri', 'duration_ms_x', 'album_name',
       'name', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,danceability,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
1,73,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,w o r k o u t,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
2,14,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,party playlist,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
3,42,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Dance mix,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
4,1,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,spin,0.904,...,audio_features,0UaMYEvWZi0ZqiDOoHU3YI,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,https://api.spotify.com/v1/tracks/0UaMYEvWZi0Z...,https://api.spotify.com/v1/audio-analysis/0UaM...,226864,4,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69


### Data Preprocessing

In [478]:
#finding the duplicates (artist name, track name and playlist name)
playlist_df[['artist_name','track_name','name']]

Unnamed: 0,artist_name,track_name,name
0,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Throwbacks
1,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),w o r k o u t
2,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),party playlist
3,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),Dance mix
4,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop),spin
...,...,...,...
67494,Jon D,I Don't Know,thinking of you
67495,Big Words,The Answer,thinking of you
67496,Allan Rayman,25.22,thinking of you
67497,Jon Jason,Good Feeling,thinking of you


In [479]:
# Drop song duplicates
def drop_duplicates(df):
    df['artists_song'] = df.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return df.drop_duplicates('artists_song')

songs_df = drop_duplicates(playlist_df)

Creating the dataset out of all Spotify features/properties/moods of the music that would be required for the model training.

In [480]:
# Select useful columns
def cols(df):
       return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop"]]
songs_df = select_cols(songs_df)
songs_df.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,82,dance_pop pop,79
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,75,pop_rap reggae_fusion,2


In [481]:
#converting genres back into lists so that they can be accessed while model training
def genre_to_list(df):
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
songs_df = genre_preprocess(songs_df)
songs_df['genres_list'].head()


0     [dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...
6                       [dance_pop, pop, post-teen_pop]
19                                [dance_pop, pop, r&b]
46                                     [dance_pop, pop]
55                             [pop_rap, reggae_fusion]
Name: genres_list, dtype: object

In [482]:
#returning the final dataset after going through the above steps
def playlist_preprocess(df):

    df = drop_duplicates(df)
    df = cols(df)
    df = genre_to_list(df)

    return df

### Feature Engineering
Here we will be performing the following:
1. Sentiment Analysis
2. One-hot encoding
3. TF-IDF
4. Normalization

In [483]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob

#### Sentiment Analysis
Here, we will be computing two parameters of sentiment of users - Subjectivity (personal opinion) and Polarity (degree of the sentiment).
Subjecticity varies from 0 to 1, where Polarity varies from -1 to 1.

In [484]:
def get_score(score, task = "polarity"):
  if task == "subjectivity":
    if score < 1/3: 
        return "low"
    elif score > 1/3:
        return "high"
    else:
        return "medium"
  else:
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'
def sentiment_analysis(df, text_col):
    df['subjectivity'] = df[text_col].apply(lambda x: TextBlob(x).sentiment.subjectivity).apply(lambda x: get_score(x,"subjectivity"))
    df['polarity'] = df[text_col].apply(lambda x: TextBlob(x).sentiment.polarity).apply(get_score)
    return df

In [485]:
sentiment_df = sentiment_analysis(songs_df, "track_name")
sentiment_df.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity
0,Missy Elliott,0UaMYEvWZi0ZqiDOoHU3YI,Lose Control (feat. Ciara & Fat Man Scoop),0.904,0.813,4,-7.105,0,0.121,0.0311,0.00697,0.0471,0.81,125.461,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,69,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral
6,Britney Spears,6I9VzXrHxO9rA9A5euc8Ak,Toxic,0.774,0.838,5,-3.914,0,0.114,0.0249,0.025,0.242,0.924,143.04,84,dance_pop pop post-teen_pop,83,"[dance_pop, pop, post-teen_pop]",low,Neutral
19,Beyoncé,0WqIKmW4BTrj3eJFmnCKMv,Crazy In Love,0.664,0.758,2,-6.583,0,0.21,0.00238,0.0,0.0598,0.701,99.259,86,dance_pop pop r&b,25,"[dance_pop, pop, r&b]",high,Negative
46,Justin Timberlake,1AWQoqb9bSvzTjaLralEkT,Rock Your Body,0.892,0.714,4,-6.055,0,0.141,0.201,0.000234,0.0521,0.817,100.972,82,dance_pop pop,79,"[dance_pop, pop]",low,Neutral
55,Shaggy,1lzr43nnXAijIGYnCT8M8H,It Wasn't Me,0.853,0.606,0,-4.596,1,0.0713,0.0561,0.0,0.313,0.654,94.759,75,pop_rap reggae_fusion,2,"[pop_rap, reggae_fusion]",low,Neutral


#### One-hot encoding
To convert categorical data representation into numerical format

In [486]:
def ohe(df, column, new_name): 
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [487]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe(sentiment, 'subjectivity','subject')
subject_ohe.iloc[0].head()

subject|0.0                    1
subject|0.03333333333333333    0
subject|0.05                   0
subject|0.05555555555555556    0
subject|0.0625                 0
Name: 0, dtype: uint8

#### TF-IDF
Using tf-idf to determine the importance of a word in the corpus of all the available songs.

In [488]:
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songs_df['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0].head()



genre|21st_century_classical    0.0
genre|432hz                     0.0
genre|_hip_hop                  0.0
genre|_roll                     0.0
genre|a_cappella                0.0
Name: 0, dtype: float64

#### Normalization
This would the last step in feature engineering, where we normalize few of the variables between 0 and 1 that would be useful for cosine similarity compuatation ahead.

In [489]:
scaler = MinMaxScaler()
pop = songs_df[["artist_pop"]].reset_index(drop = True)
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.74
1,0.84
2,0.86
3,0.82
4,0.75


#### Feature Generation
At the end, we form a new dataframe with the help of parameters computed above under feature engineering.

In [490]:
def create_feature_set(df, float_cols):
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe(df, 'polarity','polar') * 0.5
    key_ohe = ohe(df, 'key','key') * 0.5
    mode_ohe = ohe(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [491]:
# Save the data and generate the features
float_cols = songs_df.dtypes[songs_df.dtypes == 'float64'].index.values
songs_df.to_csv("allsong_data.csv", index = False)

# Generate features
complete_feature_set = create_feature_set(songs_df, float_cols=float_cols)
complete_feature_set.to_csv("complete_feature.csv", index = False)
complete_feature_set.head()




Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0UaMYEvWZi0ZqiDOoHU3YI
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H


In [492]:
playlistDF_test = playlist_df[playlist_df['name']=="spin"]
playlistDF_test.head()
playlistDF_test.to_csv("test_playlist.csv")

In [493]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist


In [494]:
# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)

In [495]:
# Non-playlist features
complete_feature_set_nonplaylist.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,6I9VzXrHxO9rA9A5euc8Ak
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0WqIKmW4BTrj3eJFmnCKMv
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,1AWQoqb9bSvzTjaLralEkT
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,1lzr43nnXAijIGYnCT8M8H
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0XUfyU2QviPAs6bxSpXYG4


In [496]:
# Summarized playlist features
complete_feature_set_playlist_vector

genre|21st_century_classical    0.0
genre|432hz                     0.0
genre|_hip_hop                  0.0
genre|_roll                     0.0
genre|a_cappella                0.0
                               ... 
key|9                           1.0
key|10                          1.0
key|11                          0.0
mode|0                          4.0
mode|1                          3.0
Length: 2178, dtype: float64

#### Cosine Similarity
Identifying musical similarities between the vector of the distilled playlist and all other tracks. Cosine similarity is one of the most often used similarity metrics out of the many that exist.

In [497]:
def generate_playlist_recos(df, features, nonplaylist_features):
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    return non_playlist_df_top_40

In [498]:
# Genreate top 10 recommendation
recommend = generate_playlist_recos(songs_df, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]


Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity,sim
56899,T-Pain,0v4150cKFLKaB56wSohF49,I'm Sprung 2 Featuring Trick Daddy and YoungBl...,0.755,0.473,0,-9.7,0,0.342,0.0883,...,0.189,0.41,100.029,80,dance_pop hip_hop pop pop_rap r&b rap southern...,0,"[dance_pop, hip_hop, pop, pop_rap, r&b, rap, s...",low,Neutral,0.717937
39765,T-Pain,4hBP52kkuHWfdVJ2a1o2aK,I'm Sprung - German Remix Featuring Kool Savas,0.547,0.475,0,-10.77,0,0.183,0.358,...,0.198,0.357,99.464,80,dance_pop hip_hop pop pop_rap r&b rap southern...,0,"[dance_pop, hip_hop, pop, pop_rap, r&b, rap, s...",low,Neutral,0.713074
4955,T-Pain,4cM0qNo02rp56x7c6nlx6T,F.B.G.M.,0.682,0.657,10,-3.632,0,0.0447,0.123,...,0.166,0.215,77.5,80,dance_pop hip_hop pop pop_rap r&b rap southern...,57,"[dance_pop, hip_hop, pop, pop_rap, r&b, rap, s...",low,Neutral,0.710938
63067,Missy Elliott,7E3HNs4hbipzo15UqzocBO,Watcha Gonna Do (feat. Timbaland) - featuring ...,0.686,0.764,10,-8.336,0,0.248,0.209,...,0.294,0.762,185.8,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,38,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral,0.710192
63064,Missy Elliott,53cVCGWe27PNZfb0KNUnjt,Take Away (feat. Ginuwine) - featuring Ginuwin...,0.835,0.432,1,-9.311,0,0.06,0.428,...,0.0508,0.661,125.852,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,46,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral,0.703136
36540,Chris Brown,5Y2IpZl9HQ52P1WfrQy0Cr,Wait For You,0.663,0.705,0,-6.282,0,0.041,0.0135,...,0.391,0.531,112.006,90,dance_pop pop pop_rap r&b rap,0,"[dance_pop, pop, pop_rap, r&b, rap]",low,Neutral,0.699677
63068,Missy Elliott,4A0Vru4hQoBenDK5yPJp9A,Step Off,0.816,0.385,10,-6.565,0,0.0896,0.111,...,0.269,0.576,104.935,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,28,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral,0.69964
58511,Chris Brown,3wrP2GVxWH8VpHPOsKhYgz,Zero,0.731,0.818,1,-4.564,0,0.0638,0.0517,...,0.0743,0.812,120.993,90,dance_pop pop pop_rap r&b rap,57,"[dance_pop, pop, pop_rap, r&b, rap]",low,Neutral,0.699544
22652,Missy Elliott,3XplJgPz8VjbDzbGwGgZdq,Get Ur Freak On,0.794,0.805,0,-6.554,1,0.23,0.538,...,0.0952,0.658,177.799,74,dance_pop hip_hop hip_pop pop pop_rap r&b rap ...,49,"[dance_pop, hip_hop, hip_pop, pop, pop_rap, r&...",low,Neutral,0.6961
54091,Chris Brown,4wM4c5ly3GR4YYgWSePlSG,Back To Sleep REMIX,0.653,0.675,9,-6.308,0,0.213,0.119,...,0.483,0.797,92.832,90,dance_pop pop pop_rap r&b rap,58,"[dance_pop, pop, pop_rap, r&b, rap]",low,Neutral,0.695609


In [500]:
recommend.to_csv("recommended_songs.csv")

In [499]:
playlistDF_test[["artist_name","track_name"]][:20]

Unnamed: 0,artist_name,track_name
4,Missy Elliott,Lose Control (feat. Ciara & Fat Man Scoop)
21822,Marc Anthony,Aguanile
23545,Drake,"Hold On, We're Going Home"
23613,Kanye West,Black Skinhead
31758,Rihanna,Where Have You Been
32029,Drake,Signs
45222,Chris Brown,Questions
45225,Debbie Deb,Lookout Weekend
45226,Lisa Lisa & Cult Jam,Can You Feel the Beat
45227,Mase,What You Want (feat. Total)
