# COGS 108 - Data Checkpoint

# Names

- Cairo Simpson
- Michelle Liu
- Michael Donaldson
- Matthew Leffler
- Charmie Donasco

<a id='research_question'></a>
# Research Question

How have the musical characteristics (e.g. danceability, tempo, key, chord progression, instrumentation, etc) of Kanye West changed/evolved from their first solo album, the College Dropout, to their latest solo, Donda (Deluxe)?

# Dataset(s)

- Dataset Name: spotify_features.csv
- Link to the dataset: https://github.com/COGS108/Group034Sp22/blob/master/Kanye_Audio_Features
- Number of observations: 199

Contains the audio features (valence, tempo, instrumentalness) and year/album for every song contained in Kanye's solo studio albums, pulled from Spotify.

Below is the code used to create the dataset using Spotify API

In [None]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

In [None]:
# Setting up dataframe
column_names = ["Track Name", "Album", "Release Date", "Danceability", "Energy", "Key", "Loudness", "Mode", "Speechiness", "Acousticness", "Instrumentalness", "Liveness", "Valence", "Tempo"]
df = pd.DataFrame(columns = column_names)

#Token Authorization
cid = "a0624b5f89ae43e081ff403d7d748fee"
secret = "74c99f42760345109f30cacae8334ad7"
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

######### Discography Part 1 ############

playlist_link = "https://open.spotify.com/playlist/1dztCX69ug6G4o9jK2zJ2S?si=xLf4TI-MQAyMvWDX2zI9Nw"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]

for track in sp.playlist_tracks(playlist_URI)["items"]:
    track_info = {}
    features = sp.audio_features(track["track"]["uri"])
    track_info["Track Name"] = track["track"]["name"]
    track_info["Album"] = track["track"]["album"]["name"]
    track_info["Release Date"] = track["track"]["album"]["release_date"]
    track_info["Danceability"] = features[0]["danceability"]
    track_info["Energy"] = features[0]["energy"]
    track_info["Key"] = features[0]["key"]
    track_info["Loudness"] = features[0]["loudness"]
    track_info["Mode"] = features[0]["mode"]
    track_info["Speechiness"] = features[0]["speechiness"]
    track_info["Acousticness"] = features[0]["acousticness"]
    track_info["Instrumentalness"] = features[0]["instrumentalness"]
    track_info["Liveness"] = features[0]["liveness"]
    track_info["Valence"] = features[0]["valence"]
    track_info["Tempo"] = features[0]["tempo"]
    df.loc[df.shape[0]] = track_info

######### Discography Part 2 ###########

playlist_link = "https://open.spotify.com/playlist/4YadIK03lniReeC1m5aFcg?si=9UqvaaA8R6qk9T5a1GZEfg"
playlist_URI = playlist_link.split("/")[-1].split("?")[0]


for track in sp.playlist_tracks(playlist_URI)["items"]:
    track_info = {}
    features = sp.audio_features(track["track"]["uri"])
    track_info["Track Name"] = track["track"]["name"]
    track_info["Album"] = track["track"]["album"]["name"]
    track_info["Release Date"] = track["track"]["album"]["release_date"]
    track_info["Danceability"] = features[0]["danceability"]
    track_info["Energy"] = features[0]["energy"]
    track_info["Key"] = features[0]["key"]
    track_info["Loudness"] = features[0]["loudness"]
    track_info["Mode"] = features[0]["mode"]
    track_info["Speechiness"] = features[0]["speechiness"]
    track_info["Acousticness"] = features[0]["acousticness"]
    track_info["Instrumentalness"] = features[0]["instrumentalness"]
    track_info["Liveness"] = features[0]["liveness"]
    track_info["Valence"] = features[0]["valence"]
    track_info["Tempo"] = features[0]["tempo"]
    df.loc[df.shape[0]] = track_info

df.to_csv("spotify_features", index = False)

- Dataset Name: lyrics.csv
- Link to the dataset: https://github.com/COGS108/Group034Sp22/blob/master/lyrics.csv
- Number of observations: 205

Below is the code used to create the dataset using Genius API.

In [None]:
import pandas as pd
import api_key
import lyricsgenius
import json

Here, we create a list of all albums that are of interest and query for JSON files by album for each of them.

In [None]:
genius = lyricsgenius.Genius(api_key.GENIUS_CLIENT_TOKEN)
genius.verbose = False
genius.remove_section_headers = True
albums = ["The College Dropout", "Late Registration", "Graduation", "808s & Heartbreak", 
          "My Beautiful Dark Twisted Fantasy", "Watch the Throne", 
          "Kanye West Presents: Good Music - Cruel Summer G.O.O.D. Music", "Yeezus", 
          "The Life of Pablo", "ye", "KIDS SEE GHOSTS", "JESUS IS KING", "Donda (Deluxe)"]
names = ["college_dropout", "late_registration", "graduation", "808s_and_heartbreak",
        "dark_fantasy", "watch_the_throne", "good_music", "yeezus", "life_of_pablo", 
        "ye", "kids_see_ghosts", "jesus_is_king", "donda"]
artists = ["Kanye West", "G.O.O.D Music", "KIDS SEE GHOSTS"]

for i, album in enumerate(albums):
    while True: # while loop because library often times out
        try:
            if album == "Kanye West Presents: Good Music - Cruel Summer G.O.O.D. Music":
                search = genius.search_album(album, "G.O.O.D Music")
            elif album == "KIDS SEE GHOSTS":
                search = genius.search_album(album, "KIDS SEE GHOSTS")
            else:
                search = genius.search_album(album, "Kanye West")
            search.save_lyrics(f'{names[i]}.json')
            break
        except:
            pass

For each JSON file, we write the data fields of interest to a CSV file, which are the lyrics.

In [None]:
df = pd.DataFrame(columns = ['Track Name', 'Album', 'Lyrics'])

for name in names:
    json_file = open(f'json/{name}.json')
    data = json.load(json_file)
    for track in data['tracks']:
        df = df.append({'Track Name': track['song']['full_title'], 'Album': data['name'], 'Lyrics': track['song']['lyrics']}, ignore_index=True)

df.to_csv('lyrics.csv', index=False)

# Setup

In [None]:
import pandas as pd
import nltk

nltk.download('stopwords')
nltk.download('punkt')

# Data Cleaning

Here, we can take a glimpse at the two datasets that we are working with.

In [None]:
df_spotify = pd.read_csv('spotify_features.csv')
df_spotify.head()

In [None]:
df_lyrics = pd.read_csv('lyrics.csv')
df_lyrics.head(204)

While the Spotify features dataset was pulled in a pretty clean manner that does not require much changes, there are definitely a few issues in the lyrics dataset. Namely, the track names aren't coherent with the Spotify dataset, and the lyrics are initialized with their respective song names.

Regarding the former, we can make a relatively simply fix by removing everything after the word 'by', since none of his actual songs seem to have the actual word. The latter could be fixed pretty easily as well by looking at everything after the first line break, which is denoted by a 'Lyrics\n'.

In [None]:
def fix_song_name(song_name):
    song_name = str(song_name)
    return song_name.split(' by')[0]

def fix_lyrics(lyrics):
    lyrics = str(lyrics)
    split = lyrics.split('Lyrics\n', 1)
    if len(split) > 1:
        return split[1]
    else:
        return lyrics

df_lyrics['Track Name'] = df_lyrics['Track Name'].map(fix_song_name)
df_lyrics['Lyrics'] = df_lyrics['Lyrics'].map(fix_lyrics)