# Data Wrangling for Capstone Two: Music & Happiness

## Introduction

### Import relevant libraries

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import datetime as dt

There are two files we need to wrangle: the song data from Spotify and the World Happiness data.

We will start by wrangling the Spotify data.

## Data Wrangling: Spotify data

In [2]:
# Load song data
songs_data = pd.read_csv('./universal_top_spotify_songs.csv')

In [3]:
# Check data summary
songs_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1562675 entries, 0 to 1562674
Data columns (total 25 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   spotify_id          1562675 non-null  object 
 1   name                1562646 non-null  object 
 2   artists             1562647 non-null  object 
 3   daily_rank          1562675 non-null  int64  
 4   daily_movement      1562675 non-null  int64  
 5   weekly_movement     1562675 non-null  int64  
 6   country             1541368 non-null  object 
 7   snapshot_date       1562675 non-null  object 
 8   popularity          1562675 non-null  int64  
 9   is_explicit         1562675 non-null  bool   
 10  duration_ms         1562675 non-null  int64  
 11  album_name          1561855 non-null  object 
 12  album_release_date  1562018 non-null  object 
 13  danceability        1562675 non-null  float64
 14  energy              1562675 non-null  float64
 15  key            

### Type conversion

`time_signature` is a categorical variable, so let's change its type to `category`.

In [4]:
# Use the .astype() method to change time_signature and key types to category.
songs_data["time_signature"] = songs_data["time_signature"].astype("category")

# Check info again
songs_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1562675 entries, 0 to 1562674
Data columns (total 25 columns):
 #   Column              Non-Null Count    Dtype   
---  ------              --------------    -----   
 0   spotify_id          1562675 non-null  object  
 1   name                1562646 non-null  object  
 2   artists             1562647 non-null  object  
 3   daily_rank          1562675 non-null  int64   
 4   daily_movement      1562675 non-null  int64   
 5   weekly_movement     1562675 non-null  int64   
 6   country             1541368 non-null  object  
 7   snapshot_date       1562675 non-null  object  
 8   popularity          1562675 non-null  int64   
 9   is_explicit         1562675 non-null  bool    
 10  duration_ms         1562675 non-null  int64   
 11  album_name          1561855 non-null  object  
 12  album_release_date  1562018 non-null  object  
 13  danceability        1562675 non-null  float64 
 14  energy              1562675 non-null  float64 
 15

For the sake of filtering, `snapshot_date` will need to be converted to a datetime object.

In [5]:
songs_data['snapshot_date'] = pd.to_datetime(songs_data['snapshot_date'])

### Filtering

In [6]:
songs_data['snapshot_date'].min()

Timestamp('2023-10-18 00:00:00')

In [7]:
songs_data['snapshot_date'].max()

Timestamp('2025-01-02 00:00:00')

Our World Happiness Report data was released in March 20, 2024, so we will need to filter our Spotify data to avoid any data leakage issues. We will therefore only use Spotify data from October 2023 through December 2023. This way, we will not be using future Spotify data to explain past happiness scores.

In [8]:
# Filter to include only songs with snapshot_date from 2023
songs_data_filtered = songs_data[songs_data['snapshot_date'].dt.year == 2023]

### Checking for missing data

In [9]:
# Check for missing data
missing = songs_data_filtered.isnull().sum()
missing

spotify_id               0
name                    22
artists                 22
daily_rank               0
daily_movement           0
weekly_movement          0
country               3706
snapshot_date            0
popularity               0
is_explicit              0
duration_ms              0
album_name             175
album_release_date     175
danceability             0
energy                   0
key                      0
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness         0
liveness                 0
valence                  0
tempo                    0
time_signature           0
dtype: int64

In [10]:
# Remove entries where country is null

# Check percentage of entries where country is non-null vs. entries where country is null
missing_countries = songs_data_filtered[['country']].isnull().sum(axis=1)
missing_countries.value_counts()/len(missing_countries) * 100

0    98.643857
1     1.356143
Name: count, dtype: float64

In [11]:
# Drop entries in songs_data_filtered where country is null
songs_data_filtered = songs_data_filtered.dropna(subset=['country'])

In [12]:
# Check percentage counts again
missing_countries = songs_data_filtered[['country']].isnull().sum(axis=1)
missing_countries.value_counts()/len(missing_countries) * 100

0    100.0
Name: count, dtype: float64

### Adding country names using a dictionary

Let's check which countries are in the data set:

In [13]:
# Check which countries are in the data set
countries_array = songs_data_filtered['country'].unique()
countries_array

array(['ZA', 'VN', 'VE', 'UY', 'US', 'UA', 'TW', 'TR', 'TH', 'SV', 'SK',
       'SG', 'SE', 'SA', 'RO', 'PY', 'PT', 'PL', 'PK', 'PH', 'PE', 'PA',
       'NZ', 'NO', 'NL', 'NI', 'NG', 'MY', 'MX', 'MA', 'LV', 'LU', 'LT',
       'KZ', 'KR', 'JP', 'IT', 'IS', 'IN', 'IL', 'IE', 'ID', 'HU', 'HN',
       'HK', 'GT', 'GR', 'GB', 'FR', 'FI', 'ES', 'EG', 'EE', 'EC', 'DO',
       'DK', 'DE', 'CZ', 'CR', 'CO', 'CL', 'CH', 'CA', 'BY', 'BR', 'BO',
       'BG', 'BE', 'AU', 'AT', 'AR', 'AE'], dtype=object)

Since the countries are listed only as country codes, we will add the full names of the countries as well for clarity.

Then, we will copy the `country_code` column and use `country_dict` to map the country codes to their respective country names.

In [14]:
# Create a dictionary with the full names of countries.
country_dict = {'AF': 'Afghanistan', 'AX': 'Åland Islands', 'AL': 'Albania', 'DZ': 'Algeria', 'AS': 'American Samoa', 'AD': 'Andorra', 'AO': 'Angola', 'AI': 'Anguilla', 'AQ': 'Antarctica', 'AG': 'Antigua and Barbuda', 'AR': 'Argentina', 'AM': 'Armenia', 'AW': 'Aruba', 'AU': 'Australia', 'AT': 'Austria', 'AZ': 'Azerbaijan', 'BS': 'Bahamas (the)', 'BH': 'Bahrain', 'BD': 'Bangladesh', 'BB': 'Barbados', 'BY': 'Belarus', 'BE': 'Belgium', 'BZ': 'Belize', 'BJ': 'Benin', 'BM': 'Bermuda', 'BT': 'Bhutan', 'BO': 'Bolivia (Plurinational State of)', 'BQ': 'Bonaire, Sint Eustatius and Saba', 'BA': 'Bosnia and Herzegovina', 'BW': 'Botswana', 'BV': 'Bouvet Island', 'BR': 'Brazil', 'IO': 'British Indian Ocean Territory (the)', 'BN': 'Brunei Darussalam', 'BG': 'Bulgaria', 'BF': 'Burkina Faso', 'BI': 'Burundi', 'CV': 'Cabo Verde', 'KH': 'Cambodia', 'CM': 'Cameroon', 'CA': 'Canada', 'KY': 'Cayman Islands (the)', 'CF': 'Central African Republic (the)', 'TD': 'Chad', 'CL': 'Chile', 'CN': 'China', 'CX': 'Christmas Island', 'CC': 'Cocos (Keeling) Islands (the)', 'CO': 'Colombia', 'KM': 'Comoros (the)', 'CD': 'Congo (the Democratic Republic of the)', 'CG': 'Congo (the)', 'CK': 'Cook Islands (the)', 'CR': 'Costa Rica', 'CI': "Côte d'Ivoire", 'HR': 'Croatia', 'CU': 'Cuba', 'CW': 'Curaçao', 'CY': 'Cyprus', 'CZ': 'Czechia', 'DK': 'Denmark', 'DJ': 'Djibouti', 'DM': 'Dominica', 'DO': 'Dominican Republic (the)', 'EC': 'Ecuador', 'EG': 'Egypt', 'SV': 'El Salvador', 'GQ': 'Equatorial Guinea', 'ER': 'Eritrea', 'EE': 'Estonia', 'SZ': 'Eswatini', 'ET': 'Ethiopia', 'FK': 'Falkland Islands (the) [Malvinas]', 'FO': 'Faroe Islands (the)', 'FJ': 'Fiji', 'FI': 'Finland', 'FR': 'France', 'GF': 'French Guiana', 'PF': 'French Polynesia', 'TF': 'French Southern Territories (the)', 'GA': 'Gabon', 'GM': 'Gambia (the)', 'GE': 'Georgia', 'DE': 'Germany', 'GH': 'Ghana', 'GI': 'Gibraltar', 'GR': 'Greece', 'GL': 'Greenland', 'GD': 'Grenada', 'GP': 'Guadeloupe', 'GU': 'Guam', 'GT': 'Guatemala', 'GG': 'Guernsey', 'GN': 'Guinea', 'GW': 'Guinea-Bissau', 'GY': 'Guyana', 'HT': 'Haiti', 'HM': 'Heard Island and McDonald Islands', 'VA': 'Holy See (the)', 'HN': 'Honduras', 'HK': 'Hong Kong', 'HU': 'Hungary', 'IS': 'Iceland', 'IN': 'India', 'ID': 'Indonesia', 'IR': 'Iran (Islamic Republic of)', 'IQ': 'Iraq', 'IE': 'Ireland', 'IM': 'Isle of Man', 'IL': 'Israel', 'IT': 'Italy', 'JM': 'Jamaica', 'JP': 'Japan', 'JE': 'Jersey', 'JO': 'Jordan', 'KZ': 'Kazakhstan', 'KE': 'Kenya', 'KI': 'Kiribati', 'KP': "Korea (the Democratic People's Republic of)", 'KR': 'Korea (the Republic of)', 'KW': 'Kuwait', 'KG': 'Kyrgyzstan', 'LA': "Lao People's Democratic Republic (the)", 'LV': 'Latvia', 'LB': 'Lebanon', 'LS': 'Lesotho', 'LR': 'Liberia', 'LY': 'Libya', 'LI': 'Liechtenstein', 'LT': 'Lithuania', 'LU': 'Luxembourg', 'MO': 'Macao', 'MK': 'Republic of North Macedonia', 'MG': 'Madagascar', 'MW': 'Malawi', 'MY': 'Malaysia', 'MV': 'Maldives', 'ML': 'Mali', 'MT': 'Malta', 'MH': 'Marshall Islands (the)', 'MQ': 'Martinique', 'MR': 'Mauritania', 'MU': 'Mauritius', 'YT': 'Mayotte', 'MX': 'Mexico', 'FM': 'Micronesia (Federated States of)', 'MD': 'Moldova (the Republic of)', 'MC': 'Monaco', 'MN': 'Mongolia', 'ME': 'Montenegro', 'MS': 'Montserrat', 'MA': 'Morocco', 'MZ': 'Mozambique', 'MM': 'Myanmar', 'NA': 'Namibia', 'NR': 'Nauru', 'NP': 'Nepal', 'NL': 'Netherlands (the)', 'NC': 'New Caledonia', 'NZ': 'New Zealand', 'NI': 'Nicaragua', 'NE': 'Niger (the)', 'NG': 'Nigeria', 'NU': 'Niue', 'NF': 'Norfolk Island', 'MP': 'Northern Mariana Islands (the)', 'NO': 'Norway', 'OM': 'Oman', 'PK': 'Pakistan', 'PW': 'Palau', 'PS': 'Palestine, State of', 'PA': 'Panama', 'PG': 'Papua New Guinea', 'PY': 'Paraguay', 'PE': 'Peru', 'PH': 'Philippines (the)', 'PN': 'Pitcairn', 'PL': 'Poland', 'PT': 'Portugal', 'PR': 'Puerto Rico', 'QA': 'Qatar', 'RE': 'Réunion', 'RO': 'Romania', 'RU': 'Russian Federation (the)', 'RW': 'Rwanda', 'BL': 'Saint Barthélemy', 'SH': 'Saint Helena, Ascension and Tristan da Cunha', 'KN': 'Saint Kitts and Nevis', 'LC': 'Saint Lucia', 'MF': 'Saint Martin (French part)', 'PM': 'Saint Pierre and Miquelon', 'VC': 'Saint Vincent and the Grenadines', 'WS': 'Samoa', 'SM': 'San Marino', 'ST': 'Sao Tome and Principe', 'SA': 'Saudi Arabia', 'SN': 'Senegal', 'RS': 'Serbia', 'SC': 'Seychelles', 'SL': 'Sierra Leone', 'SG': 'Singapore', 'SX': 'Sint Maarten (Dutch part)', 'SK': 'Slovakia', 'SI': 'Slovenia', 'SB': 'Solomon Islands', 'SO': 'Somalia', 'ZA': 'South Africa', 'GS': 'South Georgia and the South Sandwich Islands', 'SS': 'South Sudan', 'ES': 'Spain', 'LK': 'Sri Lanka', 'SD': 'Sudan (the)', 'SR': 'Suriname', 'SJ': 'Svalbard and Jan Mayen', 'SE': 'Sweden', 'CH': 'Switzerland', 'SY': 'Syrian Arab Republic', 'TW': 'Taiwan (Province of China)', 'TJ': 'Tajikistan', 'TZ': 'Tanzania, United Republic of', 'TH': 'Thailand', 'TL': 'Timor-Leste', 'TG': 'Togo', 'TK': 'Tokelau', 'TO': 'Tonga', 'TT': 'Trinidad and Tobago', 'TN': 'Tunisia', 'TR': 'Turkey', 'TM': 'Turkmenistan', 'TC': 'Turks and Caicos Islands (the)', 'TV': 'Tuvalu', 'UG': 'Uganda', 'UA': 'Ukraine', 'AE': 'United Arab Emirates (the)', 'GB': 'United Kingdom of Great Britain and Northern Ireland (the)', 'UM': 'United States Minor Outlying Islands (the)', 'US': 'United States of America (the)', 'UY': 'Uruguay', 'UZ': 'Uzbekistan', 'VU': 'Vanuatu', 'VE': 'Venezuela (Bolivarian Republic of)', 'VN': 'Viet Nam', 'VG': 'Virgin Islands (British)', 'VI': 'Virgin Islands (U.S.)', 'WF': 'Wallis and Futuna', 'EH': 'Western Sahara', 'YE': 'Yemen', 'ZM': 'Zambia', 'ZW': 'Zimbabwe'}

In [15]:
# Rename country column to country_code
songs_data_filtered_copy = songs_data_filtered
songs_data_filtered_copy.rename(columns={"country": "country_code"}, inplace = True)

# Confirm name change
songs_data_filtered_copy

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country_code,snapshot_date,popularity,is_explicit,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
1289450,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",1,0,0,ZA,2023-12-31,70,False,...,6,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4
1289451,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",2,0,0,ZA,2023-12-31,68,False,...,9,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4
1289452,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",3,0,0,ZA,2023-12-31,68,False,...,5,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4
1289453,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",4,0,0,ZA,2023-12-31,66,False,...,6,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4
1289454,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",5,0,1,ZA,2023-12-31,63,False,...,2,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1562670,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",46,4,0,AE,2023-10-18,84,True,...,5,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3
1562671,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",47,3,0,AE,2023-10-18,80,True,...,10,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4
1562672,26b3oVLrRUaaybJulow9kz,People,Libianca,48,2,0,AE,2023-10-18,88,False,...,10,-7.621,0,0.0678,0.5510,0.000013,0.1020,0.693,124.357,5
1562673,5ydjxBSUIDn26MFzU3asP4,Rainy Days,V,49,1,0,AE,2023-10-18,88,False,...,9,-8.016,0,0.0875,0.7390,0.000000,0.1480,0.282,74.828,4


In [16]:
# Duplicate country_code column
songs_data_filtered_copy['country'] = songs_data_filtered_copy['country_code']

# Map country codes to country names
songs_data_filtered_copy = songs_data_filtered_copy.replace({"country": country_dict})

# Confirm successful mapping
songs_data_filtered_copy

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country_code,snapshot_date,popularity,is_explicit,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,country
1289450,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",1,0,0,ZA,2023-12-31,70,False,...,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4,South Africa
1289451,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",2,0,0,ZA,2023-12-31,68,False,...,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4,South Africa
1289452,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",3,0,0,ZA,2023-12-31,68,False,...,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4,South Africa
1289453,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",4,0,0,ZA,2023-12-31,66,False,...,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4,South Africa
1289454,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",5,0,1,ZA,2023-12-31,63,False,...,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4,South Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1562670,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",46,4,0,AE,2023-10-18,84,True,...,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3,United Arab Emirates (the)
1562671,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",47,3,0,AE,2023-10-18,80,True,...,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4,United Arab Emirates (the)
1562672,26b3oVLrRUaaybJulow9kz,People,Libianca,48,2,0,AE,2023-10-18,88,False,...,-7.621,0,0.0678,0.5510,0.000013,0.1020,0.693,124.357,5,United Arab Emirates (the)
1562673,5ydjxBSUIDn26MFzU3asP4,Rainy Days,V,49,1,0,AE,2023-10-18,88,False,...,-8.016,0,0.0875,0.7390,0.000000,0.1480,0.282,74.828,4,United Arab Emirates (the)


### Changing spelling of country names

To ensure that our Spotify data is compatible with our World Health data, let's change some country names.

In [17]:
songs_data_filtered_copy['country'].unique()

array(['South Africa', 'Viet Nam', 'Venezuela (Bolivarian Republic of)',
       'Uruguay', 'United States of America (the)', 'Ukraine',
       'Taiwan (Province of China)', 'Turkey', 'Thailand', 'El Salvador',
       'Slovakia', 'Singapore', 'Sweden', 'Saudi Arabia', 'Romania',
       'Paraguay', 'Portugal', 'Poland', 'Pakistan', 'Philippines (the)',
       'Peru', 'Panama', 'New Zealand', 'Norway', 'Netherlands (the)',
       'Nicaragua', 'Nigeria', 'Malaysia', 'Mexico', 'Morocco', 'Latvia',
       'Luxembourg', 'Lithuania', 'Kazakhstan', 'Korea (the Republic of)',
       'Japan', 'Italy', 'Iceland', 'India', 'Israel', 'Ireland',
       'Indonesia', 'Hungary', 'Honduras', 'Hong Kong', 'Guatemala',
       'Greece',
       'United Kingdom of Great Britain and Northern Ireland (the)',
       'France', 'Finland', 'Spain', 'Egypt', 'Estonia', 'Ecuador',
       'Dominican Republic (the)', 'Denmark', 'Germany', 'Czechia',
       'Costa Rica', 'Colombia', 'Chile', 'Switzerland', 'Canada',
   

In [18]:
name_changes = {'Bolivia (Plurinational State of)':'Bolivia',
 'Dominican Republic (the)':'Dominican Republic',
 'Korea (the Republic of)':'South Korea',
 'Netherlands (the)':'Netherlands',
 'Philippines (the)':'Philippines',
 'Taiwan (Province of China)':'Taiwan',
 'United Arab Emirates (the)':'United Arab Emirates',
 'United Kingdom of Great Britain and Northern Ireland (the)':'United Kingdom',
 'United States of America (the)':'United States',
 'Venezuela (Bolivarian Republic of)':'Venezuela',
 'Viet Nam':'Vietnam'}

In [19]:
# Change country names
songs_data_filtered_copy = songs_data_filtered_copy.replace({"country": name_changes})

In [20]:
# Confirm data manipulation successful
sorted(songs_data_filtered_copy['country'].unique())

['Argentina',
 'Australia',
 'Austria',
 'Belarus',
 'Belgium',
 'Bolivia',
 'Brazil',
 'Bulgaria',
 'Canada',
 'Chile',
 'Colombia',
 'Costa Rica',
 'Czechia',
 'Denmark',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Estonia',
 'Finland',
 'France',
 'Germany',
 'Greece',
 'Guatemala',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Ireland',
 'Israel',
 'Italy',
 'Japan',
 'Kazakhstan',
 'Latvia',
 'Lithuania',
 'Luxembourg',
 'Malaysia',
 'Mexico',
 'Morocco',
 'Netherlands',
 'New Zealand',
 'Nicaragua',
 'Nigeria',
 'Norway',
 'Pakistan',
 'Panama',
 'Paraguay',
 'Peru',
 'Philippines',
 'Poland',
 'Portugal',
 'Romania',
 'Saudi Arabia',
 'Singapore',
 'Slovakia',
 'South Africa',
 'South Korea',
 'Spain',
 'Sweden',
 'Switzerland',
 'Taiwan',
 'Thailand',
 'Turkey',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Uruguay',
 'Venezuela',
 'Vietnam']

In [21]:
# Data manipulation successful, so save back to songs_data_filtered.
songs_data_filtered = songs_data_filtered_copy

# Confirm
songs_data_filtered

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country_code,snapshot_date,popularity,is_explicit,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,country
1289450,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",1,0,0,ZA,2023-12-31,70,False,...,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4,South Africa
1289451,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",2,0,0,ZA,2023-12-31,68,False,...,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4,South Africa
1289452,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",3,0,0,ZA,2023-12-31,68,False,...,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4,South Africa
1289453,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",4,0,0,ZA,2023-12-31,66,False,...,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4,South Africa
1289454,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",5,0,1,ZA,2023-12-31,63,False,...,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4,South Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1562670,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",46,4,0,AE,2023-10-18,84,True,...,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3,United Arab Emirates
1562671,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",47,3,0,AE,2023-10-18,80,True,...,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4,United Arab Emirates
1562672,26b3oVLrRUaaybJulow9kz,People,Libianca,48,2,0,AE,2023-10-18,88,False,...,-7.621,0,0.0678,0.5510,0.000013,0.1020,0.693,124.357,5,United Arab Emirates
1562673,5ydjxBSUIDn26MFzU3asP4,Rainy Days,V,49,1,0,AE,2023-10-18,88,False,...,-8.016,0,0.0875,0.7390,0.000000,0.1480,0.282,74.828,4,United Arab Emirates


### Dropping columns

Since we aren't interested in daily_rank, daily_movement, weekly_movement, or snapshot_date, let's drop these columns:

In [22]:
drop_cols = ["daily_rank", "daily_movement", "weekly_movement", "snapshot_date"]
songs_data_filtered_df = songs_data_filtered.drop(drop_cols, axis=1)

# Confirm
songs_data_filtered_df.sort_values(by=["country", "name"])

Unnamed: 0,spotify_id,name,artists,country_code,popularity,is_explicit,duration_ms,album_name,album_release_date,danceability,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,country
1456955,2wjzgkfw4MqYOtPAnREHoL,24/7 6.5,"YSY A, Jere Klein, ONIRIA",AR,63,False,137559,EL AFTER DEL AFTER,2023-11-11,0.925,...,-5.850,1,0.2450,0.187,0.000113,0.0928,0.639,129.985,4,Argentina
1460595,2wjzgkfw4MqYOtPAnREHoL,24/7 6.5,"YSY A, Jere Klein, ONIRIA",AR,60,False,137559,EL AFTER DEL AFTER,2023-11-11,0.925,...,-5.850,1,0.2450,0.187,0.000113,0.0928,0.639,129.985,4,Argentina
1292994,2bNCiY24Eh4saMcc23bvUN,ADIÓS,Maria Becerra,AR,77,True,160856,LA NENA DE ARGENTINA,2022-12-08,0.747,...,-4.184,0,0.0547,0.374,0.000005,0.2020,0.514,98.012,4,Argentina
1296640,2bNCiY24Eh4saMcc23bvUN,ADIÓS,Maria Becerra,AR,77,True,160856,LA NENA DE ARGENTINA,2022-12-08,0.747,...,-4.184,0,0.0547,0.374,0.000005,0.2020,0.514,98.012,4,Argentina
1300285,2bNCiY24Eh4saMcc23bvUN,ADIÓS,Maria Becerra,AR,77,True,160856,LA NENA DE ARGENTINA,2022-12-08,0.747,...,-4.184,0,0.0547,0.374,0.000005,0.2020,0.514,98.012,4,Argentina
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1544594,08ULi904W2Po6pVj8nN7KC,đưa em về nhàa,"GREY D, Chillies",VN,67,False,240000,đưa em về nhàa,2023-04-24,0.664,...,-10.227,1,0.0391,0.674,0.000015,0.3530,0.644,90.960,4,Vietnam
1548248,08ULi904W2Po6pVj8nN7KC,đưa em về nhàa,"GREY D, Chillies",VN,66,False,240000,đưa em về nhàa,2023-04-24,0.664,...,-10.227,1,0.0391,0.674,0.000015,0.3530,0.644,90.960,4,Vietnam
1551894,08ULi904W2Po6pVj8nN7KC,đưa em về nhàa,"GREY D, Chillies",VN,66,False,240000,đưa em về nhàa,2023-04-24,0.664,...,-10.227,1,0.0391,0.674,0.000015,0.3530,0.644,90.960,4,Vietnam
1555528,08ULi904W2Po6pVj8nN7KC,đưa em về nhàa,"GREY D, Chillies",VN,66,False,240000,đưa em về nhàa,2023-04-24,0.664,...,-10.227,1,0.0391,0.674,0.000015,0.3530,0.644,90.960,4,Vietnam


### Remove duplicate songs

Note that the countries in our Spotify data may have some duplicate song entries. These duplicates still contribute valuable data, but their popularity may vary from one snapshot to the next, so let's average their popularity before removing duplicates. Then, we will add them back in.

In [23]:
songs_data_filtered_df.columns

Index(['spotify_id', 'name', 'artists', 'country_code', 'popularity',
       'is_explicit', 'duration_ms', 'album_name', 'album_release_date',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'time_signature', 'country'],
      dtype='object')

In [24]:
# Create a copy of the dataframe before manipulating
songs_data_filtered_df_copy = songs_data_filtered_df

In [25]:
# Drop duplicates in copy and count the rows
songs_data_filtered_df_copy = songs_data_filtered_df_copy.drop_duplicates(subset=['country', 'spotify_id'])
songs_data_filtered_df_copy

Unnamed: 0,spotify_id,name,artists,country_code,popularity,is_explicit,duration_ms,album_name,album_release_date,danceability,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,country
1289450,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",ZA,70,False,351200,Isimo,2023-10-27,0.806,...,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4,South Africa
1289451,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",ZA,68,False,328096,Makhelwane,2023-11-17,0.867,...,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4,South Africa
1289452,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",ZA,68,False,410847,Permanent Music 3,2023-09-15,0.697,...,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4,South Africa
1289453,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",ZA,66,False,427217,Masithokoze,2023-10-20,0.603,...,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4,South Africa
1289454,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",ZA,63,False,373821,Funk 55,2023-12-01,0.895,...,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4,South Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1558882,5W4kiM2cUYBJXKRudNyxjW,You Proof,Morgan Wallen,AU,85,False,157477,One Thing At A Time,2023-03-03,0.732,...,-5.007,1,0.0345,0.2650,0.000000,0.6020,0.629,119.724,4,Australia
1558982,4qSEvFGCpde73gqIuq3sho,HIBIKI,"Bad Bunny, Mora",AR,89,True,208000,nadie sabe lo que va a pasar mañana,2023-10-13,0.801,...,-5.605,0,0.0706,0.6040,0.000000,0.1180,0.528,119.935,4,Argentina
1559029,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",AE,84,True,310490,For All The Dogs,2023-10-06,0.483,...,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3,United Arab Emirates
1559030,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",AE,80,True,173253,SET IT OFF,2023-10-13,0.773,...,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4,United Arab Emirates


In [26]:
gk = songs_data_filtered_df.groupby(['country', 'spotify_id'])['popularity'].mean()
gk = gk.reset_index()

In [27]:
gk

Unnamed: 0,country,spotify_id,popularity
0,Argentina,045ZeOHPIzhxxsm8bq5kyE,0.000000
1,Argentina,04sktg3deiYUweHfbFUZTM,58.760000
2,Argentina,0H9WU0OIXPpbOVgzzOanXb,82.573333
3,Argentina,0J9g1MMJDhyvOb3NWckHMm,90.500000
4,Argentina,0Me3GyNuLOa1YTIxhJPyCn,82.200000
...,...,...,...
11464,Vietnam,7pgbDdy7ax962o9d2xJceV,87.000000
11465,Vietnam,7tFwBnuaGXqiiONukPRaCo,67.306667
11466,Vietnam,7td8DTWoGC9u9db37mGHX6,64.000000
11467,Vietnam,7uoFMmxln0GPXQ0AcCBXRq,94.000000


Let's remove the `popularity` column from the songs_data_filtered_df_copy dataframe.

In [28]:
songs_data_filtered_df_copy = songs_data_filtered_df_copy.drop(['popularity'], axis=1)

In [29]:
# Confirm drop
songs_data_filtered_df_copy

Unnamed: 0,spotify_id,name,artists,country_code,is_explicit,duration_ms,album_name,album_release_date,danceability,energy,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,country
1289450,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",ZA,False,351200,Isimo,2023-10-27,0.806,0.767,...,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4,South Africa
1289451,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",ZA,False,328096,Makhelwane,2023-11-17,0.867,0.450,...,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4,South Africa
1289452,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",ZA,False,410847,Permanent Music 3,2023-09-15,0.697,0.643,...,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4,South Africa
1289453,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",ZA,False,427217,Masithokoze,2023-10-20,0.603,0.501,...,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4,South Africa
1289454,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",ZA,False,373821,Funk 55,2023-12-01,0.895,0.438,...,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4,South Africa
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1558882,5W4kiM2cUYBJXKRudNyxjW,You Proof,Morgan Wallen,AU,False,157477,One Thing At A Time,2023-03-03,0.732,0.839,...,-5.007,1,0.0345,0.2650,0.000000,0.6020,0.629,119.724,4,Australia
1558982,4qSEvFGCpde73gqIuq3sho,HIBIKI,"Bad Bunny, Mora",AR,True,208000,nadie sabe lo que va a pasar mañana,2023-10-13,0.801,0.645,...,-5.605,0,0.0706,0.6040,0.000000,0.1180,0.528,119.935,4,Argentina
1559029,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",AE,True,310490,For All The Dogs,2023-10-06,0.483,0.408,...,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3,United Arab Emirates
1559030,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",AE,True,173253,SET IT OFF,2023-10-13,0.773,0.635,...,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4,United Arab Emirates


In [30]:
# Join gk to songs_data_filtered_df_copy on spotify_id
new_df = pd.merge(songs_data_filtered_df_copy, gk,  how='left', left_on=['country','spotify_id'], right_on = ['country','spotify_id'])

In [31]:
# Rearrange the columns for better readability
new_df = new_df[['country', 'spotify_id', 'name', 'artists', 'country_code', 'popularity', 'is_explicit',
       'duration_ms', 'album_name', 'album_release_date', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature']]
new_df

Unnamed: 0,country,spotify_id,name,artists,country_code,popularity,is_explicit,duration_ms,album_name,album_release_date,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,South Africa,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",ZA,63.603175,False,351200,Isimo,2023-10-27,...,6,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4
1,South Africa,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",ZA,62.833333,False,328096,Makhelwane,2023-11-17,...,9,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4
2,South Africa,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",ZA,68.324324,False,410847,Permanent Music 3,2023-09-15,...,5,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4
3,South Africa,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",ZA,60.397059,False,427217,Masithokoze,2023-10-20,...,6,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4
4,South Africa,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",ZA,54.344828,False,373821,Funk 55,2023-12-01,...,2,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11464,Australia,5W4kiM2cUYBJXKRudNyxjW,You Proof,Morgan Wallen,AU,85.000000,False,157477,One Thing At A Time,2023-03-03,...,9,-5.007,1,0.0345,0.2650,0.000000,0.6020,0.629,119.724,4
11465,Argentina,4qSEvFGCpde73gqIuq3sho,HIBIKI,"Bad Bunny, Mora",AR,88.500000,True,208000,nadie sabe lo que va a pasar mañana,2023-10-13,...,6,-5.605,0,0.0706,0.6040,0.000000,0.1180,0.528,119.935,4
11466,United Arab Emirates,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",AE,84.000000,True,310490,For All The Dogs,2023-10-06,...,5,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3
11467,United Arab Emirates,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",AE,80.000000,True,173253,SET IT OFF,2023-10-13,...,10,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4


In [32]:
# Reassign new_df back to songs_data_filtered_df
songs_data_filtered_df = new_df

In [33]:
# Confirm
songs_data_filtered_df

Unnamed: 0,country,spotify_id,name,artists,country_code,popularity,is_explicit,duration_ms,album_name,album_release_date,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,South Africa,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",ZA,63.603175,False,351200,Isimo,2023-10-27,...,6,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4
1,South Africa,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",ZA,62.833333,False,328096,Makhelwane,2023-11-17,...,9,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4
2,South Africa,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",ZA,68.324324,False,410847,Permanent Music 3,2023-09-15,...,5,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4
3,South Africa,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",ZA,60.397059,False,427217,Masithokoze,2023-10-20,...,6,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4
4,South Africa,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",ZA,54.344828,False,373821,Funk 55,2023-12-01,...,2,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11464,Australia,5W4kiM2cUYBJXKRudNyxjW,You Proof,Morgan Wallen,AU,85.000000,False,157477,One Thing At A Time,2023-03-03,...,9,-5.007,1,0.0345,0.2650,0.000000,0.6020,0.629,119.724,4
11465,Argentina,4qSEvFGCpde73gqIuq3sho,HIBIKI,"Bad Bunny, Mora",AR,88.500000,True,208000,nadie sabe lo que va a pasar mañana,2023-10-13,...,6,-5.605,0,0.0706,0.6040,0.000000,0.1180,0.528,119.935,4
11466,United Arab Emirates,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",AE,84.000000,True,310490,For All The Dogs,2023-10-06,...,5,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3
11467,United Arab Emirates,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",AE,80.000000,True,173253,SET IT OFF,2023-10-13,...,10,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4


---

## Data Wrangling: World Happiness Data

In [34]:
# Load world happiness data
wh_data = pd.read_csv('./wh_data.csv')

In [35]:
# Check data summary
wh_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 11 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   Country name                                143 non-null    object 
 1   Ladder score                                143 non-null    float64
 2   upperwhisker                                143 non-null    float64
 3   lowerwhisker                                143 non-null    float64
 4   Explained by: Log GDP per capita            140 non-null    float64
 5   Explained by: Social support                140 non-null    float64
 6   Explained by: Healthy life expectancy       140 non-null    float64
 7   Explained by: Freedom to make life choices  140 non-null    float64
 8   Explained by: Generosity                    140 non-null    float64
 9   Explained by: Perceptions of corruption     140 non-null    float64
 10  Dystopia + res

### Rename columns

In [36]:
cols_to_rename = {
    'Country name':'country', 
    'Ladder score':'ladder_score',
    'Explained by: Log GDP per capita':'eb_log_gdp', 
    'Explained by: Social support':'eb_social_support',
    'Explained by: Healthy life expectancy':'eb_life_expectancy',
    'Explained by: Freedom to make life choices':'eb_life_choice_freedom',
    'Explained by: Generosity':'eb_generosity', 
    'Explained by: Perceptions of corruption':'eb_corruption',
    'Dystopia + residual':'dystopia_residual'
}

In [37]:
wh_data = wh_data.rename(columns=cols_to_rename)

### Check for missing data

In [38]:
# Check for missing data
missing = pd.concat([wh_data.isnull().sum(), 100 * wh_data.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count')

Unnamed: 0,count,%
country,0,0.0
ladder_score,0,0.0
upperwhisker,0,0.0
lowerwhisker,0,0.0
eb_log_gdp,3,2.097902
eb_social_support,3,2.097902
eb_life_expectancy,3,2.097902
eb_life_choice_freedom,3,2.097902
eb_generosity,3,2.097902
eb_corruption,3,2.097902


### Check which countries are in the data set

In [39]:
# Check which countries are in the data set
countries_array = wh_data['country'].unique()
sorted(countries_array)

['Afghanistan',
 'Albania',
 'Algeria',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahrain',
 'Bangladesh',
 'Belgium',
 'Benin',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 'Croatia',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Guinea',
 'Honduras',
 'Hong Kong S.A.R. of China',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Lithuania',
 'Luxem

In [40]:
# Copy wh_data for manipulation
wh_data_copy = wh_data

# Rename 'Country name' column to 'country'
wh_data_copy.rename(columns={"Country name": "country"}, inplace = True)

# Confirm name change
wh_data_copy.head()

Unnamed: 0,country,ladder_score,upperwhisker,lowerwhisker,eb_log_gdp,eb_social_support,eb_life_expectancy,eb_life_choice_freedom,eb_generosity,eb_corruption,dystopia_residual
0,Finland,7.741,7.815,7.667,1.844,1.572,0.695,0.859,0.142,0.546,2.082
1,Denmark,7.583,7.665,7.5,1.908,1.52,0.699,0.823,0.204,0.548,1.881
2,Iceland,7.525,7.618,7.433,1.881,1.617,0.718,0.819,0.258,0.182,2.05
3,Sweden,7.344,7.422,7.267,1.878,1.501,0.724,0.838,0.221,0.524,1.658
4,Israel,7.341,7.405,7.277,1.803,1.513,0.74,0.641,0.153,0.193,2.298


### Change country names

There are a few countries we identified whose names do not match those that are in the Spotify dataset, so let's change them.

In [41]:
# Change some country names
# Create a dictionary with desired name changes
name_changes = {'Hong Kong S.A.R. of China':'Hong Kong', 'Taiwan Province of China':'Taiwan', 'Turkiye':'Turkey'}

# Use replace method to change names
wh_data_copy = wh_data_copy.replace({"country": name_changes})

# Confirm name changes
sorted(wh_data_copy["country"].unique())

['Afghanistan',
 'Albania',
 'Algeria',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahrain',
 'Bangladesh',
 'Belgium',
 'Benin',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo (Brazzaville)',
 'Congo (Kinshasa)',
 'Costa Rica',
 'Croatia',
 'Cyprus',
 'Czechia',
 'Denmark',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Estonia',
 'Eswatini',
 'Ethiopia',
 'Finland',
 'France',
 'Gabon',
 'Gambia',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Guatemala',
 'Guinea',
 'Honduras',
 'Hong Kong',
 'Hungary',
 'Iceland',
 'India',
 'Indonesia',
 'Iran',
 'Iraq',
 'Ireland',
 'Israel',
 'Italy',
 'Ivory Coast',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kosovo',
 'Kuwait',
 'Kyrgyzstan',
 'Laos',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Libya',
 'Lithuania',
 'Luxembourg',
 'Madaga

In [42]:
# Save copy back to original
wh_data = wh_data_copy

# Confirm
wh_data

Unnamed: 0,country,ladder_score,upperwhisker,lowerwhisker,eb_log_gdp,eb_social_support,eb_life_expectancy,eb_life_choice_freedom,eb_generosity,eb_corruption,dystopia_residual
0,Finland,7.741,7.815,7.667,1.844,1.572,0.695,0.859,0.142,0.546,2.082
1,Denmark,7.583,7.665,7.500,1.908,1.520,0.699,0.823,0.204,0.548,1.881
2,Iceland,7.525,7.618,7.433,1.881,1.617,0.718,0.819,0.258,0.182,2.050
3,Sweden,7.344,7.422,7.267,1.878,1.501,0.724,0.838,0.221,0.524,1.658
4,Israel,7.341,7.405,7.277,1.803,1.513,0.740,0.641,0.153,0.193,2.298
...,...,...,...,...,...,...,...,...,...,...,...
138,Congo (Kinshasa),3.295,3.462,3.128,0.534,0.665,0.262,0.473,0.189,0.072,1.102
139,Sierra Leone,3.245,3.366,3.124,0.654,0.566,0.253,0.469,0.181,0.053,1.068
140,Lesotho,3.186,3.469,2.904,0.771,0.851,0.000,0.523,0.082,0.085,0.875
141,Lebanon,2.707,2.797,2.616,1.377,0.577,0.556,0.173,0.068,0.029,-0.073


### Filter Countries in World Happiness Dataset

Let's filter `wh_data` to only include the countries that are in `songs_data_filtered_df`.

In [43]:
# Check length of wh_data vs. songs_data_filtered_df.
print(f"World Happiness country count: {len(wh_data['country'].unique())}\nSpotify data country count: {len(songs_data_filtered_df['country'].unique())}")

World Happiness country count: 143
Spotify data country count: 72


In [44]:
# Create a list from country_stats_spotify and use that for boolean filtration

# Create the list using a list comprehension
spotify_countries = sorted([x for x in songs_data_filtered_df["country"].unique()])

# Filter using boolean filtration
filtered_wh_data = wh_data[wh_data["country"].isin(spotify_countries)]

# Check head of new dataframe
filtered_wh_data.head()

Unnamed: 0,country,ladder_score,upperwhisker,lowerwhisker,eb_log_gdp,eb_social_support,eb_life_expectancy,eb_life_choice_freedom,eb_generosity,eb_corruption,dystopia_residual
0,Finland,7.741,7.815,7.667,1.844,1.572,0.695,0.859,0.142,0.546,2.082
1,Denmark,7.583,7.665,7.5,1.908,1.52,0.699,0.823,0.204,0.548,1.881
2,Iceland,7.525,7.618,7.433,1.881,1.617,0.718,0.819,0.258,0.182,2.05
3,Sweden,7.344,7.422,7.267,1.878,1.501,0.724,0.838,0.221,0.524,1.658
4,Israel,7.341,7.405,7.277,1.803,1.513,0.74,0.641,0.153,0.193,2.298


In [45]:
len(spotify_countries)

72

We expect 72 countries, so let's check the size of our dataframe to see if we have 72 rows.

In [46]:
len(filtered_wh_data)

71

We are missing only one country. Let's check to see which one it is:

In [47]:
missing_country = [x for x in spotify_countries if x not in filtered_wh_data['country'].unique()]
missing_country

['Belarus']

Let's drop Belarus from the original Spotify data.

In [48]:
# Copy the dataframe for manipulation
songs_data_filtered_df_copy = songs_data_filtered_df

# Create a new dataframe without Belarus using the copy
songs_data_filtered_df_copy = songs_data_filtered_df_copy[songs_data_filtered_df_copy['country']!='Belarus']

In [49]:
songs_data_filtered_df_copy

Unnamed: 0,country,spotify_id,name,artists,country_code,popularity,is_explicit,duration_ms,album_name,album_release_date,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,South Africa,6Kijtp0DB6VwcoJIw7PJ9W,"Imithandazo (feat. Young Stunna, DJ Maphorisa,...","Kabza De Small, Mthunzi, DJ Maphorisa, Young S...",ZA,63.603175,False,351200,Isimo,2023-10-27,...,6,-9.686,0,0.1120,0.1790,0.001260,0.1820,0.795,113.001,4
1,South Africa,0UBK6HcgmmWUQzFQTncmDz,Keneilwe (feat. Dalom Kids),"Wanitwa Mos, Nkosazana Daughter, Master KG, Da...",ZA,62.833333,False,328096,Makhelwane,2023-11-17,...,9,-8.306,1,0.0467,0.0102,0.001240,0.0228,0.589,113.015,4
2,South Africa,5yyYL1FpimADTIftYQU0cg,iPlan,"Dlala Thukzin, Zaba, Sykes",ZA,68.324324,False,410847,Permanent Music 3,2023-09-15,...,5,-10.253,0,0.0384,0.0709,0.584000,0.1010,0.748,118.004,4
3,South Africa,4tsVMjM60RNTe9EV5oQ4sQ,Masithokoze,"DJ Stokie, Eemoh",ZA,60.397059,False,427217,Masithokoze,2023-10-20,...,6,-14.253,0,0.0617,0.0224,0.155000,0.0154,0.355,113.006,4
4,South Africa,5CD3ImPPdCW8QFQkjo5TXt,Funk 55,"Shakes & Les, DBN Gogo, Zee Nxumalo, Ceeka RSA...",ZA,54.344828,False,373821,Funk 55,2023-12-01,...,2,-13.307,0,0.0830,0.0230,0.000359,0.0425,0.508,113.047,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11464,Australia,5W4kiM2cUYBJXKRudNyxjW,You Proof,Morgan Wallen,AU,85.000000,False,157477,One Thing At A Time,2023-03-03,...,9,-5.007,1,0.0345,0.2650,0.000000,0.6020,0.629,119.724,4
11465,Argentina,4qSEvFGCpde73gqIuq3sho,HIBIKI,"Bad Bunny, Mora",AR,88.500000,True,208000,nadie sabe lo que va a pasar mañana,2023-10-13,...,6,-5.605,0,0.0706,0.6040,0.000000,0.1180,0.528,119.935,4
11466,United Arab Emirates,0AYt6NMyyLd0rLuvr0UkMH,Slime You Out (feat. SZA),"Drake, SZA",AE,84.000000,True,310490,For All The Dogs,2023-10-06,...,5,-9.243,0,0.0502,0.5080,0.000000,0.2590,0.105,88.880,3
11467,United Arab Emirates,2Gk6fi0dqt91NKvlzGsmm7,SAY MY GRACE (feat. Travis Scott),"Offset, Travis Scott",AE,80.000000,True,173253,SET IT OFF,2023-10-13,...,10,-5.060,1,0.0452,0.0585,0.000000,0.1320,0.476,121.879,4


In [50]:
# Update the original dataframe
songs_data_filtered_df = songs_data_filtered_df_copy

Let's copy the filtered_wh_data back to the original dataframe.

In [51]:
wh_data = filtered_wh_data

### Stats for songs data

Let's create dataframes to capture summary statistics by country and by region.

`is_explicit` and `mode` are Boolean variables, so we will handle these by calculating the percentages of `True` out of total values.
- `is_explicit` - `0` indicates a song that is not explicit. `1` indicates a song that is explicit.
- `mode` - `0` indicates a song that is in minor key. `1` indicates a song that is in major key.

We will also take song popularity into account. Musical features will be weighted by song popularity. This will better represent what people are actually listening to.

Furthermore, we will also include statistics on musical variability and musical diversity.
Variables measuring musical variability will be measured using standard deviation and will have `_std` at the end of their variable names.
Variables measuring musical diversity are measured by counting the number of distinct "bins" or categories of values. (Uses bins based on standard deviation \[bin_width = std/4\].)

In [52]:
# First, define weighted average function.
def weighted_average(group, value_col, weight_col='popularity'):
    return np.average(group[value_col], weights=group[weight_col])

In [53]:
# Define diversity calculation
def calculate_diversity(x):
    # Use standard deviation to determine bin width
    bin_width = x.std() / 4  # Using 1/4 of std dev as bin width
    if bin_width == 0:  # Handle constant values
        return 1
    # Calculate number of unique bins
    bins = np.arange(x.min(), x.max() + bin_width, bin_width)
    hist, _ = np.histogram(x, bins=bins)
    # Return number of non-empty bins
    return np.sum(hist > 0)

In [54]:
# List of features for which we will calculate weighted statistics
audio_features = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness',
                 'liveness', 'valence', 'tempo']

In [55]:
# Define a function calculate_percentage
def calculate_percentage(x):
    return x.mean() * 100

In [56]:
# Define a function for calculating weighted duration
def calculate_weighted_duration(x):
    return weighted_average(
        pd.DataFrame({
            'duration_ms': x, 
            'popularity': songs_data_filtered_df.loc[x.index, 'popularity']
        }), 
        'duration_ms'
    )

In [57]:
# Create initial statistics
initial_stats = songs_data_filtered_df.groupby('country').agg({
    'popularity': ['mean', 'std'],
    'is_explicit': calculate_percentage,
    'mode': calculate_percentage,
    'duration_ms': [
        lambda x: weighted_average(pd.DataFrame({
            'duration_ms': x, 
            'popularity': songs_data_filtered_df.loc[x.index, 'popularity']
        }), 'duration_ms'),
        'std'
    ]
})

In [58]:
# Flatten the column names
initial_stats.columns = [f"{col[0]}_{col[1]}" if isinstance(col, tuple) else col 
                        for col in initial_stats.columns]
songs_data_filtered_stats_country = initial_stats.reset_index()

# Rename columns for clarity
songs_data_filtered_stats_country = songs_data_filtered_stats_country.rename(columns={
    'popularity_mean': 'popularity',
    'popularity_std': 'popularity_std',
    'is_explicit_calculate_percentage': 'is_explicit_pct',  # Fixed naming
    'mode_calculate_percentage': 'mode_pct',                # Fixed naming
    'duration_ms_<lambda_0>': 'duration_ms',
    'duration_ms_std': 'duration_ms_std'
})

In [59]:
# Add weighted statistics and diversity metrics for audio features
for feature in audio_features:
    weighted_stats = songs_data_filtered_df.groupby('country').apply(
        lambda x: pd.Series({
            f'{feature}': weighted_average(x, feature),
            f'{feature}_std': x[feature].std(),
            f'{feature}_diversity': calculate_diversity(x[feature])
        })
    ).reset_index()
    
    songs_data_filtered_stats_country = songs_data_filtered_stats_country.merge(
        weighted_stats, on='country'
    )

  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(
  weighted_stats = songs_data_filtered_df.groupby('country').apply(


In [60]:
# Print information about the enhanced dataframe
print("Enhanced statistics dataframe structure:")
print("\nShape:", songs_data_filtered_stats_country.shape)
print("\nColumns:")
for col in sorted(songs_data_filtered_stats_country.columns):
    print(f"- {col}")

# Show diversity metrics summary
print("\nDiversity metrics summary:")
diversity_cols = [col for col in songs_data_filtered_stats_country.columns if 'diversity' in col]
print(songs_data_filtered_stats_country[diversity_cols].describe())

# Show key metrics
print("\nFirst few rows of key metrics:")
display_cols = ['country', 'popularity', 'is_explicit_pct', 'mode_pct', 
                'danceability', 'energy', 'valence']
print(songs_data_filtered_stats_country[display_cols].head())

# Verification check
print("\nVerification - checking for missing values:")
print(songs_data_filtered_stats_country.isnull().sum().sum())

Enhanced statistics dataframe structure:

Shape: (71, 34)

Columns:
- acousticness
- acousticness_diversity
- acousticness_std
- country
- danceability
- danceability_diversity
- danceability_std
- duration_ms
- duration_ms_std
- energy
- energy_diversity
- energy_std
- instrumentalness
- instrumentalness_diversity
- instrumentalness_std
- is_explicit_pct
- liveness
- liveness_diversity
- liveness_std
- loudness
- loudness_diversity
- loudness_std
- mode_pct
- popularity
- popularity_std
- speechiness
- speechiness_diversity
- speechiness_std
- tempo
- tempo_diversity
- tempo_std
- valence
- valence_diversity
- valence_std

Diversity metrics summary:
       danceability_diversity  energy_diversity  loudness_diversity  \
count               71.000000         71.000000           71.000000   
mean                18.943662         18.422535           19.957746   
std                  1.217578          1.420885            1.710778   
min                 16.000000         15.000000          

---

## Data quality checks

In [61]:
# Check for unexpected values
print("Unique values in time_signature:", songs_data_filtered_df['time_signature'].unique())
print("Value ranges for audio features:")
for feature in ['danceability', 'energy', 'valence']:
    print(f"{feature}: {songs_data_filtered_df[feature].min():.2f} to {songs_data_filtered_df[feature].max():.2f}")

Unique values in time_signature: [4, 3, 5, 1]
Categories (5, int64): [0, 1, 3, 4, 5]
Value ranges for audio features:
danceability: 0.12 to 0.98
energy: 0.01 to 1.00
valence: 0.03 to 0.99


In [62]:
print("Final dataset shapes:")
print(f"Spotify data: {songs_data_filtered_df.shape}")
print(f"World Happiness data: {wh_data.shape}")

Final dataset shapes:
Spotify data: (11263, 22)
World Happiness data: (71, 11)


## Save variables

In [63]:
# Save variables
%store wh_data
%store songs_data_filtered_df
%store songs_data_filtered_stats_country

Stored 'wh_data' (DataFrame)
Stored 'songs_data_filtered_df' (DataFrame)
Stored 'songs_data_filtered_stats_country' (DataFrame)
