# Problem

This Billboard Top 100 Dataset represents the weekly rank of songs from the moment they enter the Billboard Top 100 to the subsequent 75 weeks. Given the billboard.csv file, write a program to make it a tidy Pandas DataFrame, df_new.

**In this notebook, the instructions for each step and the results of each step are shown. Your task is to fill in the code to complete each step.**

## Step by step solution

First, import the csv file as a dataframe.

In [2]:
import pandas as pd

df = pd.read_csv("billboard.csv", encoding="mac_latin2")

df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


Next, "melt" the dataframe from wide format to long format in a new dataframe called `df_new`, so that the `x*.week` columns are converted into values in a column called "week" and the rank in each week is in a column called "rank".

You may want to refer to the `pandas.melt` documentation here: https://pandas.pydata.org/docs/reference/api/pandas.melt.html

In [9]:
# Your code here...
id_vars = ['year', 'artist.inverted', 'track', 'time', 'genre', 'date.entered', 'date.peaked']
df_new = df.melt(id_vars = id_vars, var_name = 'week', value_name = 'rank')
df_new.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,x1st.week,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,x1st.week,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,x1st.week,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,x1st.week,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,x1st.week,57.0


Next, we need to reformat the `week` column so that it only includes the week number, and not the rest of the `x*.week` stuff. We can figure out how to do this by Googling "pandas extract number from string". 

Make sure to convert the resulting values to the type "int".

In [12]:
# Your code here...
df_new['week'] = df_new.week.str.extract('(\d+)').astype('int')
df_new.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,1,78.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,1,15.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1,71.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,1,41.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,1,57.0


In [13]:
df_new.dtypes

year                 int64
artist.inverted     object
track               object
time                object
genre               object
date.entered        object
date.peaked         object
week                 int32
rank               float64
dtype: object

Next, make sure the song `rank` is also an integer. It may be useful to consult the pandas documentation regarding a specific integer type that can contain "NA" values: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

In [15]:
# Your code here...
df_new['rank'] = df_new['rank'].astype('Int64')
df_new.dtypes

year                int64
artist.inverted    object
track              object
time               object
genre              object
date.entered       object
date.peaked        object
week                int32
rank                Int64
dtype: object

Next, remove any `NA` values from the dataframe.

In [16]:
df_new.shape

(24092, 9)

In [17]:
# Your code here...
df_new.dropna(inplace=True)

In [18]:
df_new.shape

(5307, 9)

Next, create a `date` column, which is `date_entered` plus `week` minus one week. The `pd.to_datetime()` and `pd.to_timedelta()` functions may be useful.

In [19]:
# Your code here...
df_new['date'] = pd.to_datetime(df_new['date.entered']) + pd.to_timedelta((df_new['week']-1), unit='w')
df_new

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,week,rank,date
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,1,78,2000-09-23
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,1,15,2000-02-12
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1,71,1999-10-23
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,1,41,2000-08-12
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,1,57,2000-08-05
5,2000,Janet,Doesn't Really Matter,4:17,Rock,2000-06-17,2000-08-26,1,59,2000-06-17
6,2000,Destiny's Child,Say My Name,4:31,Rock,1999-12-25,2000-03-18,1,83,1999-12-25
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,2000-04-01,2000-06-24,1,63,2000-04-01
8,2000,Sisqo,Incomplete,3:52,Rock,2000-06-24,2000-08-12,1,77,2000-06-24
9,2000,Lonestar,Amazed,4:25,Country,1999-06-05,2000-03-04,1,81,1999-06-05


Next, we get rid of the extra columns. The columns we want to keep are `["year", "artist.inverted", "track", "time", "genre", "week", "rank", "date"]`.

In [20]:
# Your code here...
df_new = df_new[["year", "artist.inverted", "track", "time", "genre", "week", "rank", "date"]]
df_new

Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,1,78,2000-09-23
1,2000,Santana,"Maria, Maria",4:18,Rock,1,15,2000-02-12
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1,71,1999-10-23
3,2000,Madonna,Music,3:45,Rock,1,41,2000-08-12
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,1,57,2000-08-05
5,2000,Janet,Doesn't Really Matter,4:17,Rock,1,59,2000-06-17
6,2000,Destiny's Child,Say My Name,4:31,Rock,1,83,1999-12-25
7,2000,"Iglesias, Enrique",Be With You,3:36,Latin,1,63,2000-04-01
8,2000,Sisqo,Incomplete,3:52,Rock,1,77,2000-06-24
9,2000,Lonestar,Amazed,4:25,Country,1,81,1999-06-05


Next sort the dataframe by `["year","artist.inverted","track","week","rank"]`.

In [21]:
# Your code here...
df_new = df_new.sort_values(by=["year","artist.inverted","track","week","rank"])
df_new

Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date
246,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,1,87,2000-02-26
563,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2,82,2000-03-04
880,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,3,72,2000-03-11
1197,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,4,77,2000-03-18
1514,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,5,87,2000-03-25
1831,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,6,94,2000-04-01
2148,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,7,99,2000-04-08
287,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,1,91,2000-09-02
604,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,2,87,2000-09-09
921,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,3,92,2000-09-16


The final step in the first part is to reset the index.

In [22]:
# Your code here...
df_new.reset_index(drop=True, inplace=True)
df_new

Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,1,87,2000-02-26
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2,82,2000-03-04
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,3,72,2000-03-11
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,4,77,2000-03-18
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,5,87,2000-03-25
5,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,6,94,2000-04-01
6,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,7,99,2000-04-08
7,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,1,91,2000-09-02
8,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,2,87,2000-09-09
9,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,3,92,2000-09-16


Now, on to the second part. Create a new dataframe `songs` that only contains information relevant to the song itself: columns `["year", "artist.inverted", "track", "time", "genre"]`.

In [27]:
# Your code here...
songs = df_new[["year", "artist.inverted", "track", "time", "genre"]]
songs

Unnamed: 0,year,artist.inverted,track,time,genre
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
5,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
6,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap
7,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B
8,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B
9,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B


Next, get rid of the duplicate records (caused by having multiple rows for each song which corresponded to multiple weeks on the billboard chart).

In [28]:
songs.shape

(5307, 5)

In [29]:
# Your code here...
songs = songs.drop_duplicates() #inplace=True produces a warning

In [30]:
songs.shape

(317, 5)

Now, reset the index, and create a new column `song_id` which is equal to the index values.

In [31]:
# Your code here...
songs.reset_index(drop=True, inplace=True)
songs['song_id'] = songs.index
songs.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,year,artist.inverted,track,time,genre,song_id
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,0
1,2000,2Ge+her,The Hardest Part Of Breaking Up (Is Getting Ba...,3:15,R&B,1
2,2000,3 Doors Down,Kryptonite,3:53,Rock,2
3,2000,3 Doors Down,Loser,4:24,Rock,3
4,2000,504 Boyz,Wobble Wobble,3:35,Rap,4


Finally, on to the third part, where we create a dataframe `rank` which has columns `song_id`, `date`, and `rank`.

First, merge the `songs` dataframe with `df_new`. (Related question, does pandas have a function called `merge`?)

In [33]:
# Your code here...
ranks = df_new.merge(songs)
ranks.head()

Unnamed: 0,year,artist.inverted,track,time,genre,week,rank,date,song_id
0,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,1,87,2000-02-26,0
1,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,2,82,2000-03-04,0
2,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,3,72,2000-03-11,0
3,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,4,77,2000-03-18,0
4,2000,2 Pac,Baby Don't Cry (Keep Ya Head Up II),4:22,Rap,5,87,2000-03-25,0


Finally, get rid of the columns we don't need, and reset the index.

In [34]:
# Your code here...
ranks = ranks[['song_id', 'date', 'rank']]
ranks.head()

Unnamed: 0,song_id,date,rank
0,0,2000-02-26,87
1,0,2000-03-04,82
2,0,2000-03-11,72
3,0,2000-03-18,77
4,0,2000-03-25,87


And that is how we create tidy dataframes in Python!