# Climate Jukebox Analyses

In this forked repo, I'm planning on running some analyses on the Climate Jukebox dataset, a collection of songs related to climate change. Let's see if we can find anything fun!

**UNDER CONSTRUCTION**

In [3]:
# Imports
import numpy as np
import pandas as pd
import plotly.express as px
from umap import UMAP

In [4]:
# Set some display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [5]:
# Read in the csv and view its structure.
df = pd.read_csv("../data/climatejukebox_latest.csv")
df.head()

Unnamed: 0,Song Title,Artist,Album,Year,Popularity,Genres,Themes
0,1.5,Sirintip,carbon,2022,14.0,"Pop, Electronic",Warming
1,21st Century,The Living End,White Noise,2008,4.0,"Punk, Rock",Warming
2,4 Degrees,ANOHNI,HOPELESSNESS,2016,14.0,Pop,Warming
3,99 In October,The RTs,See You Again,2020,2.0,"Indie, Rock",Warming
4,A Beautiful Lie,30 Seconds to Mars,A Beautiful Lie,2005,62.0,"Rock, Alternative Rock",Melting Ice


In [6]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Song Title  220 non-null    str    
 1   Artist      220 non-null    str    
 2   Album       220 non-null    str    
 3   Year        220 non-null    int64  
 4   Popularity  205 non-null    float64
 5   Genres      215 non-null    str    
 6   Themes      218 non-null    str    
dtypes: float64(1), int64(1), str(5)
memory usage: 12.2 KB


Rows of the CSV correspond to songs, with the following columns:
- Song Title: The title of the song.
- Artist: The artist who wrote the song.
- Album: The album the song appears on.
- Year: The year the song was released.
- Popularity: The Spotify popularity score of the song.
- Genres: The genres of the song, comma-delimited.
- Themes: Which climate themes are present in the song, comma-delimited.

## Data Cleaning

Deal with NaNs and missing values.

In [7]:
# Count NAs
df.isna().sum()

Song Title     0
Artist         0
Album          0
Year           0
Popularity    15
Genres         5
Themes         2
dtype: int64

In [8]:
# # Replace popularity NAs with 0s
# df['Popularity'] = df['Popularity'].fillna(0.0)

# Replace genre/theme NAs with blanks
df[['Genres', 'Themes']] = df[['Genres', 'Themes']].fillna('Unlisted')

# Sanity check
df.isna().sum()

Song Title     0
Artist         0
Album          0
Year           0
Popularity    15
Genres         0
Themes         0
dtype: int64

In [9]:
# Split comma-delimited values for cells with multiple values.
df['Genres'] = df['Genres'].str.split(",").apply(lambda x: [v.strip() for v in x])
df['Themes'] = df['Themes'].str.split(",").apply(lambda x: [v.strip() for v in x])

I'm choosing to break convention here and keep multiple values per row in Genres and Themes.

In [10]:
# Check the head again
df.head()

Unnamed: 0,Song Title,Artist,Album,Year,Popularity,Genres,Themes
0,1.5,Sirintip,carbon,2022,14.0,"[Pop, Electronic]",[Warming]
1,21st Century,The Living End,White Noise,2008,4.0,"[Punk, Rock]",[Warming]
2,4 Degrees,ANOHNI,HOPELESSNESS,2016,14.0,[Pop],[Warming]
3,99 In October,The RTs,See You Again,2020,2.0,"[Indie, Rock]",[Warming]
4,A Beautiful Lie,30 Seconds to Mars,A Beautiful Lie,2005,62.0,"[Rock, Alternative Rock]",[Melting Ice]


## Descriptive Statistics

In [11]:
# Get unique genres and counts.
df['Genres'].explode().value_counts().sort_values(ascending=False)

Genres
Pop                  60
Rock                 44
Hip Hop              40
Heavy Metal          37
Electronic           20
Alternative Rock     19
Singer-Songwriter    18
Punk                 16
Indie                15
R&B                   8
Soundtrack            6
Folk                  5
Unlisted              5
Country               5
Musicals              4
Reggae                4
Christmas             3
Jazz                  3
K-pop                 2
World                 1
Name: count, dtype: int64

In [26]:
# Get unique themes and counts.
df['Themes'].explode().value_counts().sort_values(ascending=False)

Themes
Warming                120
Climate Change          45
Melting Ice             33
Climate Activism        23
Sea Level Rise          22
Carbon Emissions        22
Wildfires               21
Climate Denial          19
Biodiversity Loss       11
Droughts                11
Floods                  10
Deforestation            6
Renewable Energy         4
Unlisted                 2
Climate Migration        2
Ocean Acidification      2
Name: count, dtype: int64

In [27]:
# Get unique artist/album combinations and counts
df[['Artist', 'Album']].value_counts().sort_values(ascending=False)


Artist                                                     Album                                                                                                                       
Sirintip                                                   carbon                                                                                                                          5
King Gizzard & the Lizard Wizard                           PetroDragonic Apocalypse; or Dawn of Eternal Night: An Annihilation of Planet Earth and the Beginning of Merciless Damnation    3
Thom Yorke                                                 The Eraser                                                                                                                      2
Anvil                                                      Legal at Last                                                                                                                   2
Flobots                                                    S

In [28]:
# Get density plot of years
px.histogram(df['Year'])

In [29]:
# Histogram of popularity scores
px.histogram(df['Popularity'])


## Plots

In [34]:
# Bin the data
bin_size = 5
df['Bin'] = (df['Year'] // bin_size) * bin_size

# Explode both columns with multiple labels
df_exploded = df.explode('Genres')
df_exploded = df_exploded.explode('Themes')

df_exploded.info()

<class 'pandas.DataFrame'>
Index: 517 entries, 0 to 219
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Song Title  517 non-null    str    
 1   Artist      517 non-null    str    
 2   Album       517 non-null    str    
 3   Year        517 non-null    int64  
 4   Popularity  492 non-null    float64
 5   Genres      517 non-null    str    
 6   Themes      517 non-null    str    
 7   Bin         517 non-null    int64  
dtypes: float64(1), int64(2), str(5)
memory usage: 36.4 KB


In [38]:
# Get counts
genre_counts = df_exploded.groupby(['Bin', 'Genres']).size().unstack(fill_value=0)
theme_counts = df_exploded.groupby(['Bin', 'Themes']).size().unstack(fill_value=0)

In [40]:
px.line(data_frame=genre_counts)

In [None]:
px.line(data_frame=theme_counts)

In [31]:
# Norm counts by bin
row_norms = np.linalg.norm(counts.values, axis=1)
counts_normed = counts.div(row_norms, axis=0)
counts_normed

Genres,Alternative Rock,Christmas,Country,Electronic,Folk,Heavy Metal,Hip Hop,Indie,Jazz,K-pop,Musicals,Pop,Punk,R&B,Reggae,Rock,Singer-Songwriter,Soundtrack,Unlisted,World
Bin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1985,0.0,0.15861,0.0,0.0,0.0,0.713746,0.0,0.0,0.0,0.0,0.0,0.0,0.555136,0.0,0.0,0.396526,0.0,0.0,0.0,0.0
1990,0.166667,0.0,0.0,0.0,0.0,0.722222,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.666667,0.055556,0.0,0.0,0.0
1995,0.204124,0.0,0.0,0.204124,0.0,0.204124,0.612372,0.0,0.408248,0.0,0.0,0.408248,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0
2000,0.0,0.0,0.0,0.229416,0.0,0.0,0.0,0.229416,0.0,0.0,0.0,0.688247,0.458831,0.0,0.0,0.458831,0.0,0.0,0.0,0.0
2005,0.197952,0.118771,0.118771,0.158362,0.03959,0.079181,0.475085,0.158362,0.0,0.0,0.0,0.554265,0.356313,0.079181,0.03959,0.356313,0.277133,0.03959,0.03959,0.0
2010,0.265684,0.0,0.132842,0.088561,0.044281,0.177123,0.752773,0.221404,0.044281,0.0,0.044281,0.265684,0.044281,0.0,0.044281,0.177123,0.309965,0.0,0.221404,0.0
2015,0.121317,0.02022,0.02022,0.101098,0.060659,0.46505,0.48527,0.080878,0.0,0.0,0.080878,0.606587,0.040439,0.121317,0.080878,0.283074,0.141537,0.121317,0.0,0.0
2020,0.135477,0.0,0.033869,0.18628,0.033869,0.355626,0.050804,0.270953,0.0,0.067738,0.0,0.626579,0.050804,0.101607,0.0,0.508037,0.270953,0.0,0.016935,0.016935


In [32]:
px.line(data_frame=counts_normed)