## Github Classroom
Github project repository: https://github.com/cs418-fa24/project-check-in-team-11

## Project Introduction
Our project aims to understand the spotify song classification/reccomendation algorithm and to see if it can be accurately recreated. Through gathering songs and their respective specific statistics from Spotify, we will determine what aspects of songs does Spotify use the most to determine the mood classification of songs. In turn, we will then evaluate whether or not an overall mood can be determined accurately from a user's liked songs library.

## Scope Adjustments
We wanted to try and recreate the Spotify wrapped, however that was a large scope and more tailored towards recreating a listening profile based on other non-song related data such as listening history, time of day, and artist preference. We pivoted the scope to focus on song related data such as track features revolving around tempo, loudness, energy, danceability, etc.

## Data Collection and Cleaning

### Retrieve Liked Songs

In [None]:
import json

import spotipy
from spotipy.oauth2 import SpotifyOAuth

CLIENT_ID = ''
CLIENT_SECRET = ''
REDIRECT_URI = 'http://localhost:8888/callback'

moods = {
    'HAPPY': '37i9dQZF1EVJSvZp5AOML2',
    'SAD': '37i9dQZF1EIh4v230xvJvd',
    'CHILL': '37i9dQZF1EIdNTvkcjcOzJ',
    'ENERGETIC': '37i9dQZF1EIcVD7Tg8a0MY'
}

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    redirect_uri=REDIRECT_URI,
    scope="playlist-read-private user-library-read"  # now accessing private user playlists
))

sp1 = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    redirect_uri=REDIRECT_URI,
    scope="playlist-read-private user-library-read"  # now accessing private user playlists
))

# Get the user's liked songs
results = sp.current_user_saved_tracks()
liked_songs = []

while results:
    for item in results['items']:
        track = item['track']
        features = sp1.audio_features(track['id'])[0]
        liked_songs.append({
            'name': track['name'],
            'id': track['id'],
            'acousticness': features['acousticness'],
            'danceability': features['danceability'],
            'duration_ms': features['duration_ms'],
            'energy': features['energy'],
            'instrumentalness': features['instrumentalness'],
            'key': features['key'],
            'liveness': features['liveness'],
            'loudness': features['loudness'],
            'mode': features['mode'],
            'speechiness': features['speechiness'],
            'tempo': features['tempo'],
            'time_signature': features['time_signature'],
            'valence': features['valence']
        })

    results = sp.next(results)

#TODO rename the file so that it does not overwrite anyone else's
with open('/Users/conrad/dev-school/418/final-project/raw/liked_songs_1.json', 'w') as json_file:
    json.dump(liked_songs, json_file, indent=4)


Max Retries reached


SpotifyException: http status: 429, code:-1 - /v1/audio-features/?ids=21hsqOOUfdSjHi3SVz8oyv:
 Max Retries, reason: too many 429 error responses

### Retrieve Spotify-generated Playlists for Each Mood (happy, sad, energetic, chill)

In [None]:
### Steps to get playlists ready to pull
# 1.) Find your mix playlists for each mood (happy, sad, energetic, chill)
# 2.) Click on the "..." and add to another playlist and create a new one. Spotify will create a default name "<mood> Mix (2)"
# 3.) Once you repeat this for all the moods, you are ready to use this script

import json
import spotipy
from spotipy.oauth2 import SpotifyOAuth

# TODO insert info same as library.py...
CLIENT_ID = ''
CLIENT_SECRET = ''
REDIRECT_URI = 'http://localhost:8888/callback'

moods = {
    'HAPPY': '37i9dQZF1EVJSvZp5AOML2',
    'SAD': '37i9dQZF1EIh4v230xvJvd',
    'CHILL': '37i9dQZF1EIdNTvkcjcOzJ',
    'ENERGETIC': '37i9dQZF1EIcVD7Tg8a0MY'
}

sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    redirect_uri=REDIRECT_URI,
    scope="playlist-read-private"  # now accessing private user playlists
))

sp1 = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    redirect_uri=REDIRECT_URI,
    scope="playlist-read-private"  # now accessing private user playlists
))

for mood, p_id in moods.items():
    results = sp.playlist_items(p_id)
    tracks = []

    while results:
        for item in results['items']:
            track = item['track']
            features = sp1.audio_features(track['id'])[0]

            if features is None:
                continue

            tracks.append({
                'name': track['name'],
                'id': track['id'],
                'acousticness': features['acousticness'],
                'danceability': features['danceability'],
                'duration_ms': features['duration_ms'],
                'energy': features['energy'],
                'instrumentalness': features['instrumentalness'],
                'key': features['key'],
                'liveness': features['liveness'],
                'loudness': features['loudness'],
                'mode': features['mode'],
                'speechiness': features['speechiness'],
                'tempo': features['tempo'],
                'time_signature': features['time_signature'],
                'valence': features['valence']
            })
        print("mood complete")
        # get next set of tracks
        results = sp.next(results)

    #TODO make sure to enter the number corresponding to your data
    num = 1
    with open(f'spotify_{mood.lower()}_{num}.json', 'w') as file:
        json.dump(tracks, file, indent=4)

    file.close()


### Import into Pandas Dataframe

In [2]:
import json
import pandas as pd

moods = ['happy', 'sad', 'chill', 'energetic']
dfs = []
for mood in moods:
    files = [
        f'./raw/spotify_{mood}_1.json',
        f'./raw/spotify_{mood}_2.json',
        f'./raw/spotify_{mood}_3.json',
        f'./raw/spotify_{mood}_4.json',
        f'./raw/spotify_{mood}_5.json',
        f'./raw/spotify_{mood}_6.json',
    ]

    for file in files:
        with open(file, 'r') as fileio:
            df = pd.DataFrame(json.load(fileio))
            df['mood'] = mood
            dfs.append(df)


final_df = pd.concat(dfs, ignore_index=True)

#processing
drop = ['name', 'id']
final_df = final_df.drop(columns=drop)



FileNotFoundError: [Errno 2] No such file or directory: '../raw/spotify_happy_1.json'

## Data Exploration
Our project's data consists of multiple JSON files, each representing different sets of Spotify songs categorized by moods such as chill, energetic, happy, and sad, along with a separate collection of liked songs from personal user libraries. These files contain attributes that are important in understanding song characteristics which may influence their mood classification, such as tempo, loudness, energy, danceability, and others.

In our preliminary analysis, we explored how these attributes distribute across various playlists to hypothesize which features might be most influential in determining a song's mood. We encountered significant challenges due to the size of the data files and API rate limits.

Interestingly, initial visualizations suggest clear distinctions in certain attributes among different mood-based playlists; for instance, songs in the happy playlist tend to have higher valence and tempo compared to those in the sad playlist. This aligns with our objective to differentiate between the patterns that could potentially recreate Spotify's mood classification logic. Moving forward, we plan to apply statistical tests to confirm these observations and refine our models accordingly, aiming to predict mood classifications with high accuracy.

In [3]:
import pandas as pd
import json
import numpy as np

# look at all liked songs characteristics (mean, median, mode, etc.)

dfs = []

files = [
    f'./raw/liked_songs_1.json',
    f'./raw/liked_songs_2.json',
    f'./raw/liked_songs_3.json',
    f'./raw/liked_songs_4.json',
    # f'./raw/liked_songs_5.json',
    f'./raw/liked_songs_6.json',
]

for file in files:
    with open(file, 'r') as fileio:
        df = pd.DataFrame(json.load(fileio))
        dfs.append(df)

liked_songs_df = pd.concat(dfs, ignore_index=True)

#processing
drop = ['name', 'id']
liked_songs_df = liked_songs_df.drop(columns=drop)

#liked_songs_df.head(5)
liked_songs_df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0,2284.0
mean,0.281464,0.67125,217717.56655,0.597156,0.211037,5.384413,0.17526,-8.426431,0.548161,0.120464,117.196012,3.944834,0.546951
std,0.277863,0.162456,77039.909167,0.188689,0.338679,3.694089,0.135396,3.73894,0.497784,0.119723,28.406264,0.350863,0.234064
min,5e-06,0.0,15967.0,0.00858,0.0,0.0,0.019,-36.856,0.0,0.0,0.0,0.0,0.0
25%,0.051775,0.578,169469.0,0.478,0.0,2.0,0.094875,-9.9645,0.0,0.039075,95.02575,4.0,0.37075
50%,0.182,0.699,207177.5,0.611,0.000325,5.0,0.122,-7.7115,1.0,0.06335,114.6055,4.0,0.555
75%,0.45625,0.78725,254793.5,0.734,0.374,9.0,0.212,-5.9835,1.0,0.168,132.33225,4.0,0.74
max,0.996,0.978,828560.0,0.996,0.973,11.0,0.983,-1.345,1.0,0.918,215.449,5.0,0.981


In [9]:
# look at single category characteristics (mean, median, mode, etc.) for each category

moods = ['happy', 'sad', 'chill', 'energetic']
dfs = []
for mood in moods:
    files = [
        f'./raw/spotify_{mood}_1.json',
        f'./raw/spotify_{mood}_2.json',
        f'./raw/spotify_{mood}_3.json',
        f'./raw/spotify_{mood}_4.json',
        f'./raw/spotify_{mood}_5.json',
        f'./raw/spotify_{mood}_6.json',
    ]

    for file in files:
        with open(file, 'r') as fileio:
            df = pd.DataFrame(json.load(fileio))
            df['mood'] = mood
            dfs.append(df)

moods_df = pd.concat(dfs, ignore_index=True)

#processing
drop = ['name', 'id']
moods_df = moods_df.drop(columns=drop)

#moods_df.head(5)
moods_df.describe()

# create visualizations for the above (+ radar plot)

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0,1198.0
mean,0.365471,0.631162,193201.8,0.565686,0.235826,5.292988,0.175592,-9.046698,0.624374,0.085518,118.607917,3.918197,0.495026
std,0.342584,0.166439,67194.32,0.242741,0.368366,3.569124,0.146602,5.352631,0.484486,0.095203,28.347392,0.400484,0.270138
min,8e-06,0.0,54600.0,0.00558,0.0,0.0,0.0245,-32.838,0.0,0.0,0.0,0.0,0.0
25%,0.049,0.53725,153497.2,0.399,0.0,2.0,0.096125,-10.8855,0.0,0.036225,98.023,4.0,0.25625
50%,0.245,0.655,191253.0,0.586,0.000473,5.0,0.1165,-7.776,1.0,0.04685,117.9895,4.0,0.483
75%,0.665,0.75075,222286.8,0.757,0.53675,8.0,0.194,-5.468,1.0,0.08965,135.0495,4.0,0.727
max,0.996,0.957,1020000.0,0.998,0.983,11.0,0.983,0.328,1.0,0.918,220.099,5.0,0.981


In [10]:
happy_df = moods_df[moods_df['mood'] == 'happy']
# happy_df.head(5)
happy_df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,0.228003,0.69998,194467.153333,0.67543,0.050095,5.266667,0.173191,-6.669817,0.65,0.076948,119.553113,3.98,0.703342
std,0.221277,0.115443,45106.245635,0.147835,0.16932,3.488474,0.147277,2.635357,0.477767,0.086437,24.054849,0.230457,0.191693
min,3.8e-05,0.197,60942.0,0.166,0.0,0.0,0.0246,-22.602,0.0,0.0263,68.837,1.0,0.0646
25%,0.042775,0.633,169783.0,0.567,0.0,2.0,0.085575,-7.98625,0.0,0.036675,103.65575,4.0,0.55075
50%,0.165,0.712,197575.0,0.683,7e-06,5.0,0.113,-6.2205,1.0,0.04595,117.684,4.0,0.738
75%,0.35275,0.78025,218013.0,0.791,0.00167,8.25,0.2095,-4.87325,1.0,0.077675,129.2925,4.0,0.86
max,0.972,0.935,473013.0,0.991,0.978,11.0,0.955,-1.931,1.0,0.711,209.688,5.0,0.981


In [11]:
sad_df = moods_df[moods_df['mood'] == 'sad']
# sad_df.head(5)
sad_df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,0.522723,0.558353,199938.663333,0.432511,0.131883,5.366667,0.161904,-9.853107,0.69,0.071891,114.860097,3.86,0.324138
std,0.338082,0.150576,55978.912548,0.204289,0.277139,3.531827,0.119905,3.796737,0.463266,0.092606,30.075703,0.477311,0.189012
min,8e-06,0.174,57370.0,0.0148,0.0,0.0,0.0511,-27.117,0.0,0.0247,59.981,1.0,0.0341
25%,0.159,0.457,161423.0,0.28675,0.0,2.0,0.101,-12.0725,0.0,0.0315,88.49175,4.0,0.18
50%,0.6095,0.5565,200738.0,0.415,0.00031,5.0,0.115,-9.313,1.0,0.041,113.983,4.0,0.2955
75%,0.83,0.669,237807.0,0.57125,0.0381,8.0,0.167,-7.11525,1.0,0.062625,133.9935,4.0,0.43275
max,0.989,0.946,400560.0,0.941,0.962,11.0,0.938,-2.81,1.0,0.777,220.099,5.0,0.961


In [7]:
chill_df = moods_df[moods_df['mood'] == 'chill']
# chill_df.head(5)
chill_df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,mood
600,0.0978,0.723,62216,0.801,0.827,4,0.386,-8.253,1,0.113,87.95,4,0.918,chill
601,0.369,0.424,246186,0.439,0.796,4,0.105,-15.861,0,0.0552,94.021,4,0.278,chill
602,0.221,0.844,332821,0.558,0.0304,1,0.0693,-10.005,1,0.1,116.969,4,0.818,chill
603,0.677,0.646,238787,0.564,0.869,2,0.104,-12.531,1,0.039,96.707,4,0.36,chill
604,0.823,0.588,135276,0.302,0.872,11,0.107,-25.685,1,0.0456,77.99,4,0.885,chill


In [8]:
energetic_df = moods_df[moods_df['mood'] == 'energetic']
# energetic_df.head(5)
energetic_df.describe()

Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,mood
898,0.544,0.9,111429,0.703,0.0817,0,0.263,-5.615,1,0.117,111.976,4,0.81,energetic
899,0.261,0.835,254125,0.655,0.776,7,0.185,-9.309,0,0.0838,114.963,4,0.193,energetic
900,0.0367,0.694,182040,0.874,0.0963,5,0.955,-5.988,0,0.0413,130.007,4,0.877,energetic
901,0.0464,0.872,263036,0.454,0.825,11,0.093,-11.462,1,0.0541,118.022,4,0.0877,energetic
902,0.25,0.641,272933,0.764,0.00173,5,0.0687,-6.862,0,0.0349,124.054,4,0.786,energetic


## Data Visualization
For each mood, we pick a characteristic to see weather it can be relevant to distinguish between moods.

In [None]:
# create assumptions and histograms for features for each mood

## ML Data Analysis

### Model Training



In [7]:
from sklearn.linear_model import (
    LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet,
    BayesianRidge, SGDRegressor, SGDClassifier, Perceptron, PassiveAggressiveRegressor,
    PassiveAggressiveClassifier, RidgeClassifier, RidgeCV, LassoCV, ElasticNetCV
)

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier,
    GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor,
    BaggingClassifier, BaggingRegressor, ExtraTreesClassifier, ExtraTreesRegressor,
    VotingClassifier, VotingRegressor, StackingClassifier, StackingRegressor
)

from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, CategoricalNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.gaussian_process import GaussianProcessClassifier, GaussianProcessRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture
from sklearn.cluster import (
    KMeans, MiniBatchKMeans, MeanShift, SpectralClustering, AgglomerativeClustering,
    DBSCAN, OPTICS, Birch, AffinityPropagation
)

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import json
import pandas as pd
import json
import numpy as np
from sklearn.preprocessing import LabelEncoder

moods = ['happy', 'sad', 'chill', 'energetic']
dfs = []
for mood in moods:
    files = [
        f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_1.json',
        # f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_2.json',
        f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_3.json',
        f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_4.json',
        f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_5.json',
        # f'/Users/conrad/dev-school/418/final-project/raw/spotify_{mood}_6.json',
    ]

    for file in files:
        with open(file, 'r') as fileio:
            df = pd.DataFrame(json.load(fileio))
            df['mood'] = mood
            dfs.append(df)

final_df = pd.concat(dfs, ignore_index=True)

#processing
drop = ['name', 'id']
final_df = final_df.drop(columns=drop)

final_df.to_csv('training.csv')

#splitting
X = final_df.iloc[:, 0:13]
Y = final_df.iloc[:, 13]
# encoder = LabelEncoder()
# y = encoder.fit_transform(Y)
xtrain, xtest, ytrain, ytest = train_test_split(X, Y, test_size=.2, random_state=1)

# k-fold
k = 10
forest = RandomForestClassifier(random_state=2)
scores = cross_val_score(forest, xtrain, ytrain, cv=k)
print('CV scores', scores)
print('Mean CV scores', np.mean(scores))

# single
forest.fit(xtrain, ytrain)
print('Fit Score', forest.score(xtest, ytest))


CV scores [0.65625    0.796875   0.65625    0.734375   0.609375   0.859375
 0.609375   0.65625    0.66666667 0.66666667]
Mean CV scores 0.6911458333333333
Fit Score 0.7125


## Analysis

Based on the output below, we can see the percentage of tracks classified as a certain mood. 
Talk a little bit more about the individual outputs.

In [8]:
#analysis

import json
import pandas as pd

dfs = []

files = [
    f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_1.json',
    # f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_2.json',
    f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_3.json',
    f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_4.json',
    # f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_5.json',
    # f'/Users/conrad/dev-school/418/final-project/raw/liked_songs_6.json',
]

for file in files:
    with open(file, 'r') as fileio:
        df = pd.DataFrame(json.load(fileio))
        dfs.append(df)

#processing
drop = ['name', 'id']
for i in range(0, 3):
    dfs[i] = dfs[i].drop(columns=drop)

predictions = []

for df in dfs:
    predictions.append(forest.predict(df.iloc[:, :]))

person = 1
for prediction in predictions:
    print(f'Person {person}')
    print('Happy', (prediction.tolist().count('happy')/len(prediction.tolist()))*100)
    print('Sad', (prediction.tolist().count('sad')/len(prediction.tolist()))*100)
    print('Chill', (prediction.tolist().count('chill')/len(prediction.tolist()))*100)
    print('Energetic', (prediction.tolist().count('energetic')/len(prediction.tolist()))*100)
    print()
    person += 1

Person 1
Happy 26.87007874015748
Sad 13.385826771653544
Chill 37.99212598425197
Energetic 21.751968503937007

Person 2
Happy 38.88888888888889
Sad 12.962962962962962
Chill 7.4074074074074066
Energetic 40.74074074074074

Person 3
Happy 14.583333333333334
Sad 62.5
Chill 2.083333333333333
Energetic 20.833333333333336



## Progress reflection
#### Hardest Part of the Project:
The most challenging aspect has been handling the large data sets and managing the Spotify API rate limits, which restricted our ability to retrieve data efficiently. These technical issues required a reevaluation of our data collection strategies, including implementing caching and batching requests to better manage API usage.

#### Initial Insights:
Our exploratory data analysis revealed that certain song attributes like tempo, valence, and energy significantly vary across different moods. For instance, 'happy' songs generally exhibit higher tempo and energy levels than 'sad' songs, which tend to feature lower valence scores.

#### Concrete Results:
While we have not yet applied machine learning models to predict song moods, our preliminary visualizations and statistical analyses confirm that the attributes we are studying do correlate with the categorizations provided by Spotify.

#### Current Problems and Adjustments:
Given the limitations in data acquisition, we need to allocate more time towards enhancing our data processing capabilities to handle large volumes of data more effectively by possibly using different data querying tools. 

#### On Track:
We are on track in terms of understanding the data and setting up the necessary infrastructure for analysis. However, we are being delayer by API limit constraints.

#### Change in Project Direction:
Based on initial findings, focusing more on implementing the different features of this model could be beneficial. Enhancing our feature set by integrating additional data such as user listening habits might improve our model's accuracy and align better with Spotify's algorithm.




## Roles and Coordination

Finding data sources and cleaning (By 11/7):


Data Exploration (By 11/15): Ceasar Attar


Statistical analysis (By 11/15): 


Data Visualization (By 11/23): Ceasar Attar


Machine Learning Applications (By 11/29): 

## Next Steps
### Optimize Data Collection:
Implement more efficient data handling to manage larger datasets without hitting API limits.
### Expand Feature Analysis:
Incorporate additional features from user profiles to enrich our predictive models.
### Model Development:
Start training preliminary machine learning models using the cleaned and processed data.
### Evaluate Progress:
We will assess our model's performance based on its accuracy in classifying song moods and adjust our strategies accordingly. This evaluation will help us determine if our approach aligns with the project's goals and Spotify's classification standards.