# Data Science with Data Frames
In this notebook, we will be using the powerful tool of dataframes (a pandas object) to investigate data from two sources: Taylor Swift's discography and a database of world crime stats. We will focusing on descriptive statistics, which are statistics that describe the dataset, like mean and max.

In [2]:
import numpy as np # alias: np
import pandas as pd # alias: pd
# shift + enter --> run the cell

## Taylor Swift
We will begin this chapter by analyzing a dataset from Spotify, which contains information about all of the tracks uploaded by Taylor Swift's Spotify account. The first step is to load in our data, which we can do using the read_csv command.

In [5]:
# .csv comma separated values
# https://www.kaggle.com/datasets/arthurboari/taylor-swift-spotify-data
df = pd.read_csv('taylor_swift_spotify_data.csv')

Now that we have the data read into our notebook, let's look at the data itself. We can do this by printing the column names, or even just checking the first few values of the data table.

In [6]:
# access the columns
columns = df.columns
print(columns)

Index(['artist_name', 'artist_id', 'album_id', 'album_type',
       'album_release_date', 'album_release_year',
       'album_release_date_precision', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'track_id', 'analysis_url',
       'time_signature', 'disc_number', 'duration_ms', 'explicit',
       'track_href', 'is_local', 'track_name', 'track_preview_url',
       'track_number', 'type', 'track_uri', 'external_urls.spotify',
       'album_name', 'key_name', 'mode_name', 'key_mode'],
      dtype='object')


In [10]:
df

Unnamed: 0,artist_name,artist_id,album_id,album_type,album_release_date,album_release_year,album_release_date_precision,danceability,energy,key,...,track_name,track_preview_url,track_number,type,track_uri,external_urls.spotify,album_name,key_name,mode_name,key_mode
0,Taylor Swift,06HL4z0CvFAxyc27GXpf02,3lS1y25WAhcqJDATJK70Mq,album,2022-10-22,2022,day,0.735,0.444,10,...,Lavender Haze,,1,track,spotify:track:4g2c7NoTWAOSYDy44l9nub,https://open.spotify.com/track/4g2c7NoTWAOSYDy...,Midnights (3am Edition),A#,major,A# major
1,Taylor Swift,06HL4z0CvFAxyc27GXpf02,3lS1y25WAhcqJDATJK70Mq,album,2022-10-22,2022,day,0.658,0.378,7,...,Maroon,,2,track,spotify:track:199E1RRrVmVTQqBXih5qRC,https://open.spotify.com/track/199E1RRrVmVTQqB...,Midnights (3am Edition),G,major,G major
2,Taylor Swift,06HL4z0CvFAxyc27GXpf02,3lS1y25WAhcqJDATJK70Mq,album,2022-10-22,2022,day,0.638,0.634,4,...,Anti-Hero,,3,track,spotify:track:02Zkkf2zMkwRGQjZ7T4p8f,https://open.spotify.com/track/02Zkkf2zMkwRGQj...,Midnights (3am Edition),E,major,E major
3,Taylor Swift,06HL4z0CvFAxyc27GXpf02,3lS1y25WAhcqJDATJK70Mq,album,2022-10-22,2022,day,0.659,0.323,9,...,Snow On The Beach (feat. Lana Del Rey),,4,track,spotify:track:6ADDIJxxqzM9LMpm78yzQG,https://open.spotify.com/track/6ADDIJxxqzM9LMp...,Midnights (3am Edition),A,major,A major
4,Taylor Swift,06HL4z0CvFAxyc27GXpf02,3lS1y25WAhcqJDATJK70Mq,album,2022-10-22,2022,day,0.694,0.380,2,...,"You're On Your Own, Kid",,5,track,spotify:track:7gVWKBcfIW93YxNBi3ApIE,https://open.spotify.com/track/7gVWKBcfIW93YxN...,Midnights (3am Edition),D,major,D major
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1260,Taylor Swift,06HL4z0CvFAxyc27GXpf02,1ymIvQpnPQBj1lGlJRqrFQ,album,2006-10-24,2006,day,0.475,0.529,2,...,Mary's Song (Oh My My My) - Instrumental w/ BG...,,10,track,spotify:track:5YluKOG2VGcJMO8XVMpg9h,https://open.spotify.com/track/5YluKOG2VGcJMO8...,Taylor Swift Karaoke,D,major,D major
1261,Taylor Swift,06HL4z0CvFAxyc27GXpf02,1ymIvQpnPQBj1lGlJRqrFQ,album,2006-10-24,2006,day,0.528,0.484,2,...,Our Song - Instrumental w/ BG vocals,,11,track,spotify:track:0PHdWHKV69ZNQyfYLBlVAT,https://open.spotify.com/track/0PHdWHKV69ZNQyf...,Taylor Swift Karaoke,D,major,D major
1262,Taylor Swift,06HL4z0CvFAxyc27GXpf02,1ymIvQpnPQBj1lGlJRqrFQ,album,2006-10-24,2006,day,0.541,0.796,8,...,I'm Only Me When I'm With You - Instrumental w...,,12,track,spotify:track:4Vg8MqpDQFDfKmXdpO1jD3,https://open.spotify.com/track/4Vg8MqpDQFDfKmX...,Taylor Swift Karaoke,G#,major,G# major
1263,Taylor Swift,06HL4z0CvFAxyc27GXpf02,1ymIvQpnPQBj1lGlJRqrFQ,album,2006-10-24,2006,day,0.575,0.279,7,...,Invisible - Instrumental w/ BG vocals,,13,track,spotify:track:7Fg8MxumrT8axFZVzN1MtT,https://open.spotify.com/track/7Fg8MxumrT8axFZ...,Taylor Swift Karaoke,G,major,G major


how many columns are in this data table? how many rows (i.e., data points)?

In [11]:
print(len(columns))
print(len(df))
print(36*1265)

36
1265
45540


now, let's move on to generating some descriptive statistics (statistical quantities that describe the dataset)

In [14]:
# What is the average length of a Taylor Swift song?
np.mean(df['duration_ms'])

230381.6181818182

In [16]:
def convert_time(time_ms):
    # accepts the time in milliseconds and converts to seconds / minutes / hours
    time_sec = time_ms / 1000
    time_min = int(time_sec / 60)
    time_sec = int(time_sec % 60)
    print(time_min,'mins',time_sec,'seconds')
    
convert_time(np.mean(df['duration_ms']))

3 mins 50 seconds


Often, the purpose of generating descriptive statistics is to see how our dataset compares to another. So, how does our result for the average Taylor Swift song length compare to other modern pop songs? Well, the average in recent years is about... 3 minutes and 50 seconds, too!

https://www.vox.com/2014/8/18/6003271/why-are-songs-3-minutes-long

How about the tempo, then? the average tempo is 116 bpm

https://www.washingtonpost.com/news/to-your-health/wp/2015/10/30/the-mathematical-formula-behind-feel-good-songs/

In [20]:
np.round(np.mean(df['tempo']),3)

120.877

what are the top 3 keys that Taylor Swift likes to write in?

In [32]:
keys = df['key_mode'].value_counts()
keys
# np.sum(keys[0:3])/np.sum(keys)

G major     224
F major     156
C major     151
E major     138
D major     135
F# major     70
G# major     64
A major      63
A# major     50
C# major     44
C# minor     25
A minor      21
B minor      20
E minor      20
D# major     17
C minor      12
F minor      10
B major       9
G# minor      8
G minor       7
D# minor      7
A# minor      6
F# minor      4
D minor       4
Name: key_mode, dtype: int64

### Data Cleaning
Often, the original / raw dataset that we get from the real world is not exactly what we want to analyze. For example, let's list all of Taylor Swift's albums on Spotify

In [42]:
df['album_name'].unique()

array(['Midnights (3am Edition)', 'Midnights', "Red (Taylor's Version)",
       "Fearless (Taylor's Version)", 'evermore (deluxe version)',
       'evermore',
       'folklore: the long pond studio sessions (from the Disney+ special) [deluxe edition]',
       'folklore (deluxe version)', 'folklore', 'Lover',
       'Taylor Swift Karaoke: reputation', 'reputation',
       'reputation (Big Machine Radio Release Special)',
       'reputation Stadium Tour Surprise Song Playlist',
       'Taylor Swift Karaoke: 1989 (Deluxe)', '1989',
       '1989 (Big Machine Radio Release Special)',
       'Taylor Swift Karaoke: 1989 (Deluxe Edition)',
       'Taylor Swift Karaoke: 1989', '1989 (Deluxe Edition)',
       '1989 (Deluxe)', 'Red (Deluxe Edition)',
       'Red (Big Machine Radio Release Special)', 'Red (Karaoke Version)',
       'Red', 'Taylor Swift Karaoke: Red', 'Speak Now (Japanese Version)',
       'Speak Now World Tour Live', 'Speak Now',
       'Speak Now (Big Machine Radio Release Specia

it seems like there are lots of albums that have several versions. can we clean out this data?


In [51]:
# create a mask, helps you select which data you want (boolean series)
mask = (df['album_name'].str.contains('\(')) | (df['album_name'].str.contains('Karaoke'))
print(df['album_name'][~mask].unique())

['Midnights' 'evermore' 'folklore' 'Lover' 'reputation'
 'reputation Stadium Tour Surprise Song Playlist' '1989' 'Red'
 'Speak Now World Tour Live' 'Speak Now' 'Fearless'
 'Fearless Platinum Edition' 'Live From Clear Channel Stripped 2008'
 'Taylor Swift']


I am trying to open a club and I need some music to play. What's Taylor Swift's most danceable song?

In [55]:
max_dance = df['danceability'].max()
mask = df['danceability'] == max_dance
df[mask]['track_name']

370    I Think He Knows
Name: track_name, dtype: object

great job with this! now, we're going to move on to another example

## World Crime Index
The world crime index is a way of measuring how dangerous cities are, using a unitless index out of 100. The higher the crime index (and thus the lower the safety index is), the more dangerous the city. Let's start, just like we did with Taylor Swift, by reading in the data.

In [62]:
# https://www.kaggle.com/datasets/ahmadjalalmasood123/world-crime-index

df = pd.read_csv('world_crime_index.csv')
df.head(20)

Unnamed: 0,Rank,City,Crime Index,Safety Index
0,1,"Caracas, Venezuela",83.98,16.02
1,2,"Pretoria, South Africa",81.98,18.02
2,3,"Celaya, Mexico",81.8,18.2
3,4,"San Pedro Sula, Honduras",80.87,19.13
4,5,"Port Moresby, Papua New Guinea",80.71,19.29
5,6,"Durban, South Africa",80.6,19.4
6,7,"Johannesburg, South Africa",80.55,19.45
7,8,"Kabul, Afghanistan",79.39,20.61
8,9,"Rio de Janeiro, Brazil",77.93,22.07
9,10,"Natal, Brazil",77.69,22.31


It looks like we have 453 cities in our database, and it's already sorted by most dangerous (descending). Can we find out which country has the safest cities? We'll have to make a new column in our data frame, since "City" includes both the city and the country.

In [59]:
df['City'][0].split(', ')[1]

'Venezuela'

In [61]:
# add in the country data
df['Country'] = [x.split(', ')[1] for x in df['City']] # list comprehension
countries = df['Country']
countries.unique()

array(['Venezuela', 'South Africa', 'Mexico', 'Honduras',
       'Papua New Guinea', 'Afghanistan', 'Brazil', 'Trinidad And Tobago',
       'MD', 'Argentina', 'TN', 'MI', 'Australia', 'Jamaica',
       'United Kingdom', 'NM', 'Peru', 'Ecuador', 'MO', 'El Salvador',
       'Colombia', 'Namibia', 'Puerto Rico', 'Dominican Republic',
       'Syria', 'Angola', 'LA', 'WI', 'CA', 'Nigeria', 'IL', 'France',
       'Philippines', 'Canada', 'OH', 'TX', 'Kazakhstan', 'Maldives',
       'AB', 'Bangladesh', 'PA', 'Italy', 'Malaysia', 'GA', 'Guatemala',
       'Libya', 'AK', 'Zimbabwe', 'Romania', 'Tanzania', 'India', 'Chile',
       'DC', 'Iraq', 'Kenya', 'Belarus', 'IN', 'Belgium', 'FL', 'Sweden',
       'Greece', 'KY', 'Iran', 'Botswana', 'Morocco', 'Costa Rica',
       'Mongolia', 'MN', 'WA', 'NV', 'Uruguay', 'Spain', 'Pakistan',
       'Algeria', 'Vietnam', 'Cambodia', 'Indonesia', 'Ukraine',
       'Portugal', 'Ireland', 'OR', 'Russia', 'AZ', 'Egypt', 'Paraguay',
       'BC', 'NY', 'VA', 'Tur

uh oh! it seems like our method split the USA into states, which we did not want. how can we consider the whole USA all at once?

In [66]:
# add in the country data
df['Country'] = [x.split(', ')[-1] for x in df['City']] # list comprehension
countries = df['Country']
countries = countries.unique()
countries

array(['Venezuela', 'South Africa', 'Mexico', 'Honduras',
       'Papua New Guinea', 'Afghanistan', 'Brazil', 'Trinidad And Tobago',
       'United States', 'Argentina', 'Australia', 'Jamaica',
       'United Kingdom', 'Peru', 'Ecuador', 'El Salvador', 'Colombia',
       'Namibia', 'Puerto Rico', 'Dominican Republic', 'Syria', 'Angola',
       'Nigeria', 'France', 'Philippines', 'Canada', 'Kazakhstan',
       'Maldives', 'Bangladesh', 'Italy', 'Malaysia', 'Guatemala',
       'Libya', 'Zimbabwe', 'Romania', 'Tanzania', 'India', 'Chile',
       'Iraq', 'Kenya', 'Belarus', 'Belgium', 'Sweden', 'Greece', 'Iran',
       'Botswana', 'Morocco', 'Costa Rica', 'Mongolia', 'Uruguay',
       'Spain', 'Pakistan', 'Algeria', 'Vietnam', 'Cambodia', 'Indonesia',
       'Ukraine', 'Portugal', 'Ireland', 'Russia', 'Egypt', 'Paraguay',
       'Turkey', 'Montenegro', 'Panama', 'Ethiopia', 'Tunisia', 'Ghana',
       'North Macedonia', 'New Zealand', 'Lebanon', 'Thailand',
       'Bosnia And Herzegovina', 

In [68]:
country_avg_safety = np.zeros(len(countries))
for i in range(len(countries)):
    # let's say we're on Cuba. it will mask and only show Cuba cities
    mask = df['Country'] == countries[i] 
    # compute the mean of the masked data
    country_avg_safety[i] = df[mask]['Safety Index'].mean()

In [70]:
df = pd.DataFrame({'Country': countries, 'Average Safety': country_avg_safety})
df.sort_values('Average Safety', ascending = False)

Unnamed: 0,Country,Average Safety
117,Qatar,86.040000
115,United Arab Emirates,85.175000
116,Taiwan,84.950000
114,Oman,79.460000
113,Bahrain,78.940000
...,...,...
1,South Africa,22.663333
5,Afghanistan,20.610000
4,Papua New Guinea,19.290000
3,Honduras,19.130000


based on this data, it seems like Qatar has the safest streets!

limitations:
* criminal statutes differ across countries
* not all cities included
* soft crime like harrassment would not be considered

note: it's always important to consider the limitations of your dataset as a data scientist