# **Have music listening habits changed over the last 6 years?**

## Project 1 - group 2

# Data preparation

### Background

In this project, we aim to understand whether music listening habits have changed in the last 6 years. 

There are two main reasons that lead us to believe music listening habits have changed:
- the global COVID pandemic (2020-2021) is likely to have led to changes in how people listen to music and what music they listen to;
- several articles mention the effects that the rise of social network TikTok (from 2018 on) is having on the music industry (see, for instance, https://theconversation.com/love-it-or-hate-it-tiktok-is-changing-the-music-industry-171482)

We thus hypothesise that **listening habits are likely to have changed**, and asked the following questions to investigate that question:
1. Have the music genres people listen to changed?
2. Have the artists people listen to changed?
3. Is the duration of the most popular songs decreasing?
4. How have other track features changed?


### Data collection
Every year, Spotify releases a playlist with the top 100 hit songs for that year. In order to answer the questions above, we selected 6 playlists, one for each of the last 6 years. 

We chose playlists from 2017 to 2022 in order to capture both, any changes preceding the rise of TikTok and the COVID pandemic, as well as any long lasting changes in music listening habits post-COVID.

Once the playlists were selected, we used the Spotipy API (https://spotipy.readthedocs.io/en/2.22.1/) to request information about each track on the playlist. Note that in order to get the requests, we first needed to set up our credentials, in the Spotify's Web API (https://developer.spotify.com/documentation/web-api). 

The information obtained from Spotipy was then combined into a single dataframe, with an added column to specify playlist year, resulting in a final dataframe with 600 rows (100 songs for each of the 6 years) by 24 columns. 

The dataframe was carefully inspected for any missing data or duplicates, before being saved as a csv file. Duplicated songs that were on playlists for different years were kept, as the same song might be in the top for more than one year and that is important information. 

More information on the project and on the final results can be found in the project git hub repository (https://github.com/catisf/Project-1-Group-2) and in the data analyses jupyter notebook (https://github.com/catisf/Project-1-Group-2/blob/main/jupyter_notebooks/spotipy_data_analyses.ipynb)



In [1]:
# import dependencies
import spotipy as sp
from spotipy.oauth2 import SpotifyClientCredentials
from pprint import pprint
import pandas as pd
import numpy as np

# Authorisation
from config import client_id
from config import client_secret

# Initialize the Spotify client with client credentials flow
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = sp.Spotify(client_credentials_manager=client_credentials_manager)

In [2]:
# Top hits playlists for the last 6 years
Years = {"2017": "https://open.spotify.com/playlist/37i9dQZF1DWTE7dVUebpUW",
        "2018": "https://open.spotify.com/playlist/37i9dQZF1DXe2bobNYDtW8",
        "2019": "https://open.spotify.com/playlist/37i9dQZF1DWVRSukIED0e9",
        "2020": "https://open.spotify.com/playlist/2fmTTbBkXi8pewbUvG3CeZ",
        "2021": "https://open.spotify.com/playlist/5GhQiRkGuqzpWZSE7OU4Se",
        "2022": "https://open.spotify.com/playlist/56r5qRUv3jSxADdmBkhcz7"}

# set empty lists to add data to
track_features = []
track_id = []
track_name = []
track_popularity = []
artist_name = []
artist_id = []
artist_genre = []

# loop through years to request data we need for each year
for year in Years:
    playlist_URI = Years[year]
    
    # get tracks for this year
    track_id_this_year = [x["track"]["id"] for x in sp.playlist_tracks(playlist_URI)["items"]]
   
    # get track features
    track_features.extend(sp.audio_features(track_id_this_year))    
    
    #get the rest of tracks info
    track_id.extend(track_id_this_year)
    track_name.extend ([x["track"]["name"] for x in sp.playlist_tracks(playlist_URI)["items"]])
    track_popularity.extend ([x["track"]["popularity"] for x in sp.playlist_tracks(playlist_URI)["items"]])
    
    # get artist info
    artist_name.extend([x["track"]["artists"][0]["name"] for x in sp.playlist_tracks(playlist_URI)["items"]])
    artist_id.extend([x["track"]["artists"][0]["id"] for x in sp.playlist_tracks(playlist_URI)["items"]])
    
    # get artist uri so we can get genre
    artist_uri = [x["track"]["artists"][0]["uri"] for x in sp.playlist_tracks(playlist_URI)["items"]]
    artist_genre.extend([sp.artist(uri)["genres"] for uri in artist_uri])
    

In [3]:
# join all the information in a dataframe

# first the track info
years = ['2017', '2018', '2019', '2020', '2021', '2022']
year_index = np.repeat(years,100)

track_info = pd.DataFrame({"Year": year_index,
                           "id" : track_id,
                           "artist id" : artist_id,
                           "track name" : track_name,
                           "artist name" : artist_name,
                           "artist genre": artist_genre,
                           "track popularity" : track_popularity})

# then the track features
track_features_df = pd.DataFrame(track_features)

# and now merge the two
complete_df = pd.merge(track_info, track_features_df, on = "id", how = "inner")

# the merged df has 44 duplicates (same ID same year), so getting rid of them here. We end up with a df with len = 600 (100 songs for each year)
complete_df = complete_df.drop_duplicates(["id", "Year"])

# save output
complete_df.to_csv('../output_data/spotipy_data.csv')