<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Data Manipulation, EDA, and Reporting Results

_Authors: Joseph Nelson (DC), Sam Stack (DC)_

---

> **This lab is intentionally open-ended, and you're encouraged to answer your own questions about the dataset!**


### What makes a song a hit?

On next week's episode of the 'Are You Entertained?' podcast, we're going to be analyzing the latest generation's guilty pleasure- the music of the '00s. 

Our Data Scientists have poured through Billboard chart data to analyze what made a hit soar to the top of the charts, and how long they stayed there. Tune in next week for an awesome exploration of music and data as we continue to address an omnipresent question in the industry- why do we like what we like?

**Provide (at least) a markdown cell explaining your key learnings about top hits: what are they, what common themes are there, is there a trend among artists (type of music)?**

---

### Minimum Requirements

**At a minimum, you must:**

- Use Pandas to read in your data
- Rename column names where appropriate
- Describe your data: check the value counts and descriptive statistics
- Make use of groupby statements
- Utilize Boolean sorting
- Assess the validity of your data (missing data, distributions?)

**You should strive to:**

- Produce a blog-post ready description of your lab
- State your assumptions about the data
- Describe limitations
- Consider how you can action this from a stakeholder perspective (radio, record label, fan)
- Include visualizations

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import math

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Billboard data CSV:
billboard_csv = './datasets/billboard.csv'

# We need to use encoding='latin-1' to deal with non-ASCII characters.
df = pd.read_csv(billboard_csv, encoding='latin-1')

In [2]:
bbdata = df
bbdata.head(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,
5,2000,Janet,Doesn't Really Matter,4:17,Rock,2000-06-17,2000-08-26,59,52.0,43.0,...,,,,,,,,,,
6,2000,Destiny's Child,Say My Name,4:31,Rock,1999-12-25,2000-03-18,83,83.0,44.0,...,,,,,,,,,,
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,2000-04-01,2000-06-24,63,45.0,34.0,...,,,,,,,,,,
8,2000,Sisqo,Incomplete,3:52,Rock,2000-06-24,2000-08-12,77,66.0,61.0,...,,,,,,,,,,
9,2000,Lonestar,Amazed,4:25,Country,1999-06-05,2000-03-04,81,54.0,44.0,...,,,,,,,,,,


In [3]:
bbdata.describe()

Unnamed: 0,year,x1st.week,x2nd.week,x3rd.week,x4th.week,x5th.week,x6th.week,x7th.week,x8th.week,x9th.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
count,317.0,317.0,312.0,307.0,300.0,292.0,280.0,269.0,260.0,253.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,2000.0,79.958991,71.173077,65.045603,59.763333,56.339041,52.360714,49.219331,47.119231,46.343874,...,,,,,,,,,,
std,0.0,14.686865,18.200443,20.752302,22.324619,23.780022,24.473273,25.654279,26.370782,27.136419,...,,,,,,,,,,
min,2000.0,15.0,8.0,6.0,5.0,2.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
25%,2000.0,74.0,63.0,53.0,44.75,38.75,33.75,30.0,27.0,26.0,...,,,,,,,,,,
50%,2000.0,81.0,73.0,66.0,61.0,57.0,51.5,47.0,45.5,42.0,...,,,,,,,,,,
75%,2000.0,91.0,84.0,79.0,76.0,73.25,72.25,67.0,67.0,67.0,...,,,,,,,,,,
max,2000.0,100.0,100.0,100.0,100.0,100.0,99.0,100.0,99.0,100.0,...,,,,,,,,,,


In [7]:
bbdata_enhanced = bbdata
bbdata_enhanced["Weeks on BB"] = 0
bbdata_enhanced["Average Rank"] = 0
bbdata_enhanced["Max"] = 0
bbdata_enhanced["Min"] = 0

In [13]:

for i in range(0,len(bbdata_enhanced)):
    bbranks = bbdata_enhanced.iloc[i, 7:83]
    count = 0
    rank_sum = 0
    min_n = 100
    max_n = 0
    for j in bbranks:
        if math.isnan(j) == False:
            count = count + 1
            rank_sum = rank_sum + j
            if j > max_n:
                max_n = j
            if j < min_n:
                #print("Min is {} and j is {}".format(min_n, j))
                min_n = j
    if count > 0:
        bbdata_enhanced.iloc[i, 83] = count
        bbdata_enhanced.iloc[i, 84] = rank_sum / count
        bbdata_enhanced.iloc[i, 85] = max_n
        bbdata_enhanced.iloc[i, 86] = min_n

bbdata_enhanced.tail(10)

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week,Weeks on BB,Average Rank,Max,Min
307,2000,"Larrieux, Amel",Get Up,4:02,R&B,2000-03-04,2000-03-11,100,97.0,97.0,...,,,,,,,3,98.0,100.0,97.0
308,2000,"Braxton, Toni",Spanish Guitar,4:24,Rock,2000-12-02,2000-12-02,98,98.0,98.0,...,,,,,,,3,98.0,98.0,98.0
309,2000,Tuesday,I Know,4:06,Rock,2000-12-30,2000-12-30,98,98.0,,...,,,,,,,2,98.0,98.0,98.0
310,2000,LL Cool J,Imagine That,4:00,Rap,2000-08-12,2000-08-19,99,98.0,,...,,,,,,,2,98.5,99.0,98.0
311,2000,Master P,Souljas,3:33,Rap,2000-11-18,2000-11-18,98,,,...,,,,,,,1,98.0,98.0,98.0
312,2000,Ghostface Killah,Cherchez LaGhost,3:04,R&B,2000-08-05,2000-08-05,98,,,...,,,,,,,1,98.0,98.0,98.0
313,2000,"Smith, Will",Freakin' It,3:58,Rap,2000-02-12,2000-02-12,99,99.0,99.0,...,,,,,,,4,99.0,99.0,99.0
314,2000,Zombie Nation,Kernkraft 400,3:30,Rock,2000-09-02,2000-09-02,99,99.0,,...,,,,,,,2,99.0,99.0,99.0
315,2000,"Eastsidaz, The",Got Beef,3:58,Rap,2000-07-01,2000-07-01,99,99.0,,...,,,,,,,2,99.0,99.0,99.0
316,2000,Fragma,Toca's Miracle,3:22,R&B,2000-10-28,2000-10-28,99,,,...,,,,,,,1,99.0,99.0,99.0


In [14]:
bbdata_enhanced.rename(columns={'date.entered':'start', 'date.peaked':'end'}, inplace=True)

bbdata_enhanced.head()

Unnamed: 0,year,artist.inverted,track,time,genre,start,end,x1st.week,x2nd.week,x3rd.week,...,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week,Weeks on BB,Average Rank,Max,Min
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,28,14.821429,78.0,1.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,26,10.5,48.0,1.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,33,17.363636,71.0,1.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,24,13.458333,44.0,1.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,21,19.952381,57.0,1.0


In [18]:
ten_weeks_or_more = bbdata_enhanced[bbdata_enhanced['Weeks on BB'] > 9]
len(ten_weeks_or_more)

242

In [20]:
temp = bbdata_enhanced[['genre','Weeks on BB']].groupby('genre')['Weeks on BB'].mean()
temp.head(20)

genre
Country        16.216216
Electronica    18.000000
Gospel         20.000000
Jazz            5.000000
Latin          19.222222
Pop            15.222222
R&B            11.347826
Rap            14.431034
Reggae         15.000000
Rock           18.883212
Name: Weeks on BB, dtype: float64

In [21]:
temp2 = bbdata_enhanced[['genre','Average Rank']].groupby('genre')['Average Rank'].mean()
temp2.head(20)

genre
Country        66.690659
Electronica    65.170833
Gospel         67.750000
Jazz           51.800000
Latin          52.133637
Pop            64.055007
R&B            74.677187
Rap            67.915707
Reggae         72.400000
Rock           52.941287
Name: Average Rank, dtype: float64

In [None]:
# Roughly 80% of songs that made the top 100 stayed on the top 100 for at least 10 weeks.
# Jazz songs spent the least amount of time on the BB and Gospel songs send the most
# Jazz, Latin and Rock tend to have the best average rank and R&B and Reggae tend to have the worst average rank