# Billboard Hot 100 - Exploratory Analysis

## Introduction

Our dataset consists of the 600 songs which made it onto the [Billboard Hot 100](billboard.com/charts/hot-100) in the past 12 months. We also have relevant metadata for those songs obtainined through [Genius](https://genius.com/), a website that allows its users to provide annotations and interpretation of song lyrics.

The Billboard Hot 100 is a music industry standard chart that lists the most commercially successful singles in North America. The rankings are based on a combination of record sales (both physical and digital), radio play, and online streaming. 

The reason I chose to gather the metadata from Genius, as opposed to other sources because:
1. It is publically available and has a relatively simple and well documented API to work with
2. It has standardized and comprehensive listing of the information I was most interested in - songwriter and production credits and song lyrics

In [1]:
import json
from datetime import datetime as dt
import reprlib

In [2]:
# Read in data
filename = 'billboard-hot100.json'
chartData = json.load(open(filename, 'r'))
reprlib.repr(chartData[0])

"{'billboardArtist': 'Post Malone ...ing 21 Savage', 'billboardTitle': 'Rockstar', 'chartHist': [{'chartName': 'hot-100', 'datesOnChart': ['2017-11-11', '2017-11-04', '2017-10-28', '2017-10-21', '2017-10-14', '2017-10-07'], 'peakRank': 1, 'rankHist': [1, 1, 1, 2, 2, 2], ...}], 'featuredArtists': [{'genius_id': 430404, 'genius_url': 'https://geni...sts/21-savage', 'name': '21 Savage'}], ...}"

## Producer Ranking

The initial inspiration behind this project was to (quantitatively) identify the most commercially successful producers and song-writers of today. In lieu of the fact having the raw sales and streaming data to work with in our dataset, our notion of commercial success will have to be defined through chart performance. One of the simplest definition of commercial success we can make is by only considering the volume of chart placements. That is, comparing the number of different songs that made it on the charts. 

We will define this problem more generically as such:


$$
\text{producer_score}(\text{producer}) := \sum_{\text{song} \ \in \ \text{producer.songs}} \text{weight}(\text{producer}, \text{song}) \cdot \text{song_score(song)}
$$

We can think of the score of a producer as being the weighted sum of the songs scores (of songs that they contributed to). The $\text{song_score(song)}$ is a number that summarizes a song's commerical performance and the weights $\text{weight}(\text{producer}, \text{song})$ represent the relative impact/importance of a producer's contribution to a particular song. We're initially interested in defining the score to be the number of chart songs that a producer contributed to. This can be accomplished by defining:

$$\begin{align}
\text{song_score}(\text{song}) & := 1 \\
\text{weight}(\text{producer},\text{song}) & := 1
\end{align}
$$

This means all songs have the same score (irrespective of their actual chart performance) and all contributors are given the same score. We're going to have to adjust this if we want the $\text{prod_score}$ to reflect a more accurate and comprehensive sense of commercial success. 

In [3]:
# Returns the top n personale (producers/writers) ranked by their score (as defined above)
def top(n, chartData, personale,songScore):
    staffScore = {} 
    
    for song in chartData:
        score = songScore(song)
        for staff in song[personale]:
            if staff["name"] in staffScore:
                staffScore[staff["name"]] += score
            else:
                staffScore[staff["name"]] = score
                
    return sorted(staffScore.items(), key=lambda x: x[1], reverse=True)[:n]

top10prod = top(10, chartData, "producers", lambda x: 1)

print("Top 10 Producers (Ranked by the number of Hot 100 placements in 2017)")
print("-------------------------------------------------------------------")
for producer, count in top10prod:
    print(f"{producer} \t\t {count}")

Top 10 Producers (Ranked by the number of Hot 100 placements in 2017)
-------------------------------------------------------------------
Benny Blanco 		 18
Metro Boomin 		 16
Frank Dukes 		 14
Southside 		 13
Dann Huff 		 13
The Weeknd 		 12
CuBeatz 		 11
Max Martin 		 10
Murda Beatz 		 10
Doc McKinney 		 9


A lot of the names on the list don't come as a surprise (as they are the most sought after producers in the Hip-Hop and Pop music world). I was somewhat surprised to find that The Weeknd and CuBeatz were on this list being that:

1. The Weeknd is primarily known for his work as a vocalist and performer, not as a producer. That being said, it is not uncommon for the performing artist/vocalist to get production credits for overseeing the development and musical direction of the song. For the purposes of our analysis, we want to focus more on the "primary producers", those most responsible for both define and execute the sonic direction of the song. Ironically, under this metric 'Doc McKinney', one of The Weeknd's longtime in-house producers, is actually ranked lower than The Weeknd.
2. I had not heard of CuBeats prior to this but after looking at their credits through Genius, they had contributed to some of the most popular Hip-Hop songs in the past few years (mainly through their collaboration with other producers in the list, like Metro Boomin, Southside, and Murda Beatz). 

Both of these points address to the fact that many of the biggest Hip-Hop and Pop songs tend to have a number of different producers and song-writers that contribute to the song. We need to account for that by (ideally) weighting the song score by the relative significance of each proucer's contribution. Unfortunately, such information can't be obtained programatically through the Genius API, therefore we cannit assign weights to the producers non-uniformly. Thus, the most naieve solution to get around this is to just define the weight based on the number of contributors to the song:

$$\text{weight(producer, song)} := \text{weight(song)} := \frac{1}{| \ \text{song.producers} \ |}$$

Note that this will effectively penalize those producers that are more collaborative.

In [4]:
def songScore(song):
    if "producers" not in song or (len(song["producers"]) is 0):
        return 0
    else:
        return (1.0/len(song["producers"]))
        
top(10, chartData, "producers", songScore)

[('Dann Huff', 9.916666666666666),
 ('Benny Blanco', 9.45),
 ('Metro Boomin', 8.116666666666665),
 ('No I.D.', 7.5),
 ('Calvin Harris', 7.0),
 ('Greg Kurstin', 6.75),
 ('Southside', 6.166666666666666),
 ('The Chainsmokers', 6.0),
 ('Busbee', 5.5),
 ('Frank Dukes', 5.333333333333334)]

We find that our rankings have changed pretty drastically after this adjustment. We find that certain producers who weren't in the previous list have very high positioning with this new scoring. The two entries that standout  are:

1. No I.D., a respected Hip-Hop producer and label executive, responsible for producing Jay-Z's 2017 album '4:44'
2. Calvin Harris, a popular dance and pop music producer and DJ, who released 'Funk Wav Bounces Vol. 1' this year (a collaborative album with some of the most popular artists of today)

What stands out about them is that both producers didn't collaborate with other producers in their respective projects (which didn't perform as well on the charts as some of the other lower ranked producers in the list). Their inclusion (and positioning) in the list indicate that our metric: 

* Excessivly penalized those producers who tend to collaborate with others (such as Frank Dukes and CuBeatz)
* Doesn't take into account the chart performance of the respective songs

The second point is highlighted in No I.D.'s inclusion into the list as none of the songs produced by him broke into the top 20 or were on the charts for more than 2 weeks. We need to change our $\text{song_score}$ to actually take into account the chart performance. One simple way is by just summing the scaled rank across all the weeks the song was on the Billboard 100:

$$\text{score_song(song)} := \sum_{\text{week}} \frac{101 - \text{rank(song, week)}}{10}$$

In [5]:
def score(song):
    if len(song.get("producers", [])) is 0:
        return 0
    else:
        runningScore = 0
        for chart in song["chartHist"]:
            runningScore += ((101.0/10.0) * chart["weeks"]) - sum(list(map(lambda x: x/5.0, chart["rankHist"])))

    return runningScore/len(song["producers"])

top(10, chartData, "producers", score)

[('Shampoo Press & Curl', 479.5),
 ('Mike WiLL Made-It', 459.7583333333334),
 ('Metro Boomin', 411.33666666666653),
 ('The Chainsmokers', 410.6000000000001),
 ('Steve Mac', 378.9666666666666),
 ('Daft Punk', 331.59999999999997),
 ('Benny Blanco', 259.0416666666666),
 ('Zach Crowell', 254.89999999999998),
 ('Ed Sheeran', 195.6333333333333),
 ('DJ Swivel', 168.59999999999997)]

Our ranks have changed dramatically after this adjustment. A few things from this seem to stand out:
* Shampoo Press & Curl (Bruno Mars' in-house song-writing and production group) are number two on our list, while only having a **single** Hot 100 placement. Granted that placement was Bruno Mars' hit single 'Thats What I Like' which performed really well on the charts (its been on the charts for 43 weeks). 
* Daft Punk, the famed French production/DJ duo, contributed to two hit singles from The Weeknd ('Starboy' and 'I Feel It Coming') which peaked at number 1 and 4, respectively. 

Based on this alone, it seems that our metric is inflenced too much by a particular song's sustained presence on the charts and discounting the number of placements (which is more indicitive of a sort of 'commerical consistency' that we'd like our metric to capture). 

We'll downweight the influence on any particular song's sustained presence on the charts while still accounting for its preformance as such:
    
$$\text{song_score(song)} = 1 + 0.5 \sum_{i \in \{1, 10, 20, 50\}} I(\text{song.peakRank} \le i)$$

where the indicator function $I(\text{song.peakRank} \le i)$ is 1 if $\text{song.peakRank} \le i$ and 0 otherwise. This new metric essentially assigns each song that charts a base score of 1 and increments the score by 0.5 based on wether the song reached certain milestones such as reaching the top 50, top 20, etc. From this we could derive certain equivalences (e.g. a number one hit has the same aggregate score as that of 3 songs that make it into the top 50).

In [6]:
# Given a value x a closed interval [low, high], this helper function returns 1 if
# x lies in that in interval and 0 otherwise
def indicator(x, low, high):
    if x >= low and x <= high:
        return 1
    else:
        return 0


def score(song):
    if len(song.get("producers", [])) is 0:
        return 0
    else:
        runningScore = 0
        for chart in song["chartHist"]:
            runningScore += 1 + (0.5 * sum([indicator(chart["peakRank"], 1, x) for x in [1, 10, 50]]))

    return runningScore

top(10, chartData, "producers", score)

[('Benny Blanco', 25.5),
 ('Metro Boomin', 24.5),
 ('Frank Dukes', 18.5),
 ('Southside', 17.0),
 ('Dann Huff', 17.0),
 ('The Weeknd', 15.5),
 ('Max Martin', 15.0),
 ('CuBeatz', 15.0),
 ('Mike WiLL Made-It', 13.5),
 ('Murda Beatz', 12.5)]

While we can continue to tweak our metric, for our demonstrative purposes it looks good enough. Although, we still have the issue of The Weeknd and CuBeatz being placed on the chart because of our inability to appropriately weight production contributions. 