# Introduction

Our dataset consists of all the songs which landed on the [Billboard Hot 100](billboard.com/charts/hot-100) in the past 12 months and related metadata obtained through [Genius](https://genius.com/), a website that allows its users to provide annotations and interpretation of song lyrics.

The Billboard Hot 100 is a music industry standard chart that lists the most commercially successful singles in North America. The rankings are based on a combination of record sales (both physical and digital), radio play, and online streaming. 

** In this notebook, I'll be exploring the various ways of visualizing the Billboard Hot 100 data with the graphing library [Plotly](https://plot.ly/python/) and do some tangential exploratory analysis.**

In [1]:
import json
from datetime import datetime as dt
import reprlib
import math

import plotly.offline as py
import plotly.graph_objs as go

py.init_notebook_mode(connected=True)

In [2]:
filename = 'billboard-hot100.json'
chartData = json.load(open(filename, 'r'))
reprlib.repr(chartData[0])

"{'billboardArtist': 'Post Malone ...ing 21 Savage', 'billboardTitle': 'Rockstar', 'chartHist': [{'chartName': 'hot-100', 'datesOnChart': ['2017-11-11', '2017-11-04', '2017-10-28', '2017-10-21', '2017-10-14', '2017-10-07'], 'peakRank': 1, 'rankHist': [1, 1, 1, 2, 2, 2], ...}], 'featuredArtists': [{'genius_id': 430404, 'genius_url': 'https://geni...sts/21-savage', 'name': '21 Savage'}], ...}"

In [3]:
graphData = []
for song in chartData[:20]:
    for chart in song["chartHist"]:
        x = list(map(lambda x: dt.strptime(x,'%Y-%m-%d'),chart.get("datesOnChart", [])))
        y = chart.get("rankHist",[])
        alp = (101 - chart["peakRank"])/100
        title = "{0} - {1}".format(song["billboardArtist"], song["billboardTitle"])
        graphData.append(go.Scatter(x=x, y=y, mode = 'lines+markers', opacity = alp, text = title, name = ''))
        
layout = dict(
    title = "Billboard Top 20 Rank History",
    xaxis = dict(title = "Date", range = ['2016-11-10','2017-11-10']),
    yaxis = dict(title = "Ranking", range = [100,1]),
    showlegend = False,
    hovermode = 'closest'
)

fig = dict(data=graphData, layout=layout)
py.iplot(fig)

In the above graph, we've plotted the rank history of the current top 20 songs on the Billboard Hot 100 chart. As we'd expect most of the songs have had a relatively steady upward trend in their chart position over time. One thing to noting is that a majority of the current top 20 didn't enter the top 20 until a couple of months ago (70% of them entered the top 20 within in the past 2 months). 

I am most curious to see if we could identify certain marketing actions that explain for certain spurious chart behavior.

* One interesting jump I noticed is that of the Luis Fonsi and Daddy Yankee song *'Despaciato'* which jumped from #48 on April 29 to #9 in the following week. What I found was that, the song was originally released on January 26 but was later remixed by Justin Bieber on April 17 (right before the jump) which clearly contributed to the success that it had on the charts.  This is clearly a great example of where remixing a song can extend and improve chart performance (especially if the remixer is a pop star).

* Another interesting case to note is the Ed Sheeran song *'Perfect'* which entered (and quickly exited) the charts on Mar 25, a few weeks after Ed Sheeran had released his album '$\div$'. It wasn't released as a single until later in September 26, 2017, after which it quickly re-entered and climed the charts (it is at #10 as of November 11). This illustrates how the promotion of a song as a radio single can significantly impact its commercial performance.

The Ed Sheeran point makes me curious: **How long does it takes songs to reach their peak position on the charts (relative to their release date)**

In [4]:
# Given the chart information on a song, this function returns the date in
# which the song first reached its peak rank
def findPeakDate(chart):
    minIdx = chart["rankHist"].index(chart["peakRank"])
    return chart["datesOnChart"][minIdx]

weeksTilPeak = []
weeksTilEnter = []

# Iterate through all the songs in our dataset and compute the number of
# days until they (1) enter the charts and (2) reach their peak position
for song in chartData:
    for chart in song["chartHist"]:
        if song["releaseDate"] not in ['', None]:
            peakDate = dt.strptime(findPeakDate(chart),'%Y-%m-%d')
            enterDate = dt.strptime(chart["datesOnChart"][0],'%Y-%m-%d')
            releaseDate = dt.strptime(song["releaseDate"],'%Y-%m-%d')
            
            peakDelta = peakDate - releaseDate
            enterDelta = enterDate - releaseDate
            
            
            if enterDelta.days > 0:
                weeksTilPeak.append(min(53, math.ceil(peakDelta.days/7)))
                weeksTilEnter.append(min(53, math.ceil(enterDelta.days/7)))

freqPeak = go.Histogram(x=weeksTilPeak, histnorm='probability',xbins=dict(start=0,end=52,size=4))
layout = dict(
    title = "Histogram of # of weeks until song reaches peak rank",
    xaxis = dict(title = "# of Weeks"), yaxis = dict(title = "Relative Frequency", range = [0,1])
)
fig = go.Figure(data=[freqPeak], layout=layout)
py.iplot(fig)

cumFreqPeak = go.Histogram(x=weeksTilPeak, histnorm='probability', cumulative=dict(enabled=True),
                           xbins=dict(start=0,end=52,size=4))
layout["title"] = "Cumulative " + layout["title"]
fig = go.Figure(data=[cumFreqPeak], layout=layout)

py.iplot(fig)

We have plotted the distribution of the number a weeks (from release) that it takes for a song to reaches its peak position on the Billboard Hot 100 chart. We have augmented the last bin to include songs which took longer than a year (52 weeks) to peak. Here are some things to note:
* Nearly half (45%) of the songs that charted in the past year took between 1 to 2 months to reach their peak position
* Roughly 80% of the songs took less than 6 months to reach their peak position 

In [5]:
freqEnter = go.Histogram(x=weeksTilEnter, histnorm='probability',
                         xbins=dict(start=0,end=52,size=4))
layout = dict(
    title = "Histogram of # of weeks until song enters Billboard Hot 100",
    xaxis = dict(title = "# of Weeks"), yaxis = dict(title = "Relative Frequency", range = [0,1]))
fig = go.Figure(data=[freqEnter], layout=layout)
py.iplot(fig)

cumFreqEnter = go.Histogram(x=weeksTilEnter,histnorm='probability',cumulative=dict(enabled=True),
                            xbins=dict(start=0,end=52,size=4))
layout["title"] = "Cumulative " + layout["title"]
fig = go.Figure(data=[cumFreqEnter], layout=layout)
py.iplot(fig)

In the above graphs, we plotted the distribution of the number a weeks (from release) that it takes for a song to enter the Billboard Hot 100 chart. Here are some things I observed:

* Nearly 40% of the songs entered the charts within 2 months of release
* Roughly 75% of songs took less than 6 months to enter the Hot 100 charts

**Conclusion**: If a record label or artist hopes for a song to achieve mass commercial success, it should be heavily promoted prior to or soon after its release. Otherwise, due to the frantic pace of new releases in the music industry - its likely that it would be overshadowed or forgotten. It would be interesting to compare these distributions over time to see how the changes within the music industry affect the pace in which songs chart and reach a critical mass. 