In [1]:
import pandas
import numpy as np
import re
import string

# Introduction

On September 28, 1984, the hit single "Let's Go Crazy" by Prince and The Revolution climbed to the top of the U.S. Billboard charts ("Chart History: Prince"), becoming not only one of Prince's five number-one singles, but also the first song to ever top Billboard despite being marked as "explicit" (Ross). Since then, songs featuring explicit content have become increasingly popular, with recent studies of Billboard chart data revealing that the proportion of Billboard number-one singles with explicit content has surged from below 10% in 1984 to over 60% in 2017 (Bannister.) This in turn raises the question: **given the recent rise in commercial success of explicit music, has critical reception of explicit music similarly improved over the years?**

To answer this question, I began by scraping data from Metacritic (https://www.metacritic.com/music, a website that aggregates various critic reviews of music, among other forms of entertainment) and Genius (https://genius.com/, a website centered around song lyrics and music news). Specifically, I focused on exploring Metacritic's top 50 to top 100 most critically acclaimed albums every year from 2010 to 2020. Using lyrical data from Genius, I computed the specific percentage and quantity of explicit content on each of these albums, and found that in general...

<div style="border-bottom: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold;"></div>

# Data Description

As aforementioned, the dataset utilized in this study was manually constructed by scraping web data from Metacritic and Genius. We explore this dataset in further detail below:

### What are the observations and the attributes of this dataset?

Each observation in this dataset represents an album, with the entire set of observations representing a collection of the top ranked albums by Metacritic for each year from 2010 to 2020. (*Note:* Each album's average critical reception is calculated by Metacritic, and is subsequently used to give each album a score from 0 to 100. It is this aggregate score that is then used to rank all albums released each year. All rankings used in this study were accurate as of May 5, 2020.) For each album, six attributes are stored in the dataset:

1. the name of the album artist; 
2. the *metascore*, or the average critic score (on an integer scale from 0-100) that the album received on Metacritic; 
3. the title of the album; 
4. the average user score (on a decimal scale from 0.0-10.0) that the album received from Metacritic users;
5. the date on which the album was released; and
6. a list of all available album lyrics on Genius, separated by track.

### Why was this dataset created?

This dataset was created solely for the purpose of carrying out this study, and therefore is meant to facilitate analysis of any potential relationships between the amount of profanity in an album and the album’s critical reception.

### Who funded the creation of this dataset?

There were no significant funds necessary to create this dataset, and indeed no grant associated with this dataset either.

### What processes might have influenced what data was observed and recorded and what was not?

For one, Metacritic itself already prevents certain releases from being included in their year-end album rankings, noting, “Only albums with seven or more reviews are eligible. EPs, live albums, box sets, re-issues, and compilations are also excluded” ("Best Music and Albums for 2020"). Therefore, all projects deemed “ineligible” by Metacritic are not included in this study's dataset either.

In addition, album lyrics data might not be collected for every track of every album, given that Genius does not include a lyrics page for every possible track of every album in the dataset. (For example, Genius does not provide lyrics for the last six tracks of *Grassed In* by Australian rock band Blank Realm.)

### What preprocessing was done, and how did the data come to be in the form that you are using?

Since this dataset was constructed manually for the purpose of carrying out this study, there was no point where an already-compiled dataset had to be pre-processed. However, the process of building this dataset was fairly similar to the usual stage of pre-processing, and therefore is explained below:

First, album data for Metacritic’s top ranked albums (for each year from 2010 to 2020) was collected by copying and pasting directly from Metacritic web pages to a Microsoft Excel spreadsheet. (An example of one such web page can be found [here](https://www.metacritic.com/browse/albums/score/metascore/year/filtered?view=condensed&sort=desc).) The data was arranged in a single column, with every set of five rows (beginning with the set of rows 0, 1, 2, 3, and 4) representing a single album. For each set of five rows, the rows corresponded to the following album data:

- Row 0: the name of the album artist
- Row 1: the metascore, or the average critic score (on a scale from 0-100) that the album received on Metacritic
- Row 2: the title of the album
- Row 3: the average user score (on a scale from 0.0-10.0) that the album received from Metacritic users
- Row 4: the day, month, and year on which the album was released

With this row structure in mind, the spreadsheet was converted to a CSV file (comma-separated values file), which was then converted to a Pandas DataFrame object for further data processing. From there, by iterating through the rows of the DataFrame object, a dictionary representing each album could be constructed for each set of five rows. This culminated in a list of dictionaries containing all basic album information.

From there, to obtain album lyrics data, a couple of Python functions were written to iterate through the aforementioned list of dictionaries. For each dictionary (or equivalently, for each album), a Genius URL was dynamically constructed using the album name and album title (ex: http://genius.com/albums/The-weeknd/After-hours). Each of these URLs contains an overview on the album in question, including a tracklist and links to pages for all available lyrics (for each track on the album). Lyrics were then scraped from each of these lyrics pages, and subsequently stored in the appropriate dictionary to lead to each dictionary featuring the following format:

`{"artist": (name of album artist),
"metascore": (average critic score that album received on Metacritic),
"release_date": (day on which album was released in the format d-mmm-yy, e.g. 6-Mar-20),
"title": (title of album),
"user_score": (average user score that album received on Metacritic),
"lyrics": (all available album lyrics, concatenated as a single string)}`

At this point, however, preliminary analysis of this dataset revealed that approximately 27.6% of all album entries did not have any scraped lyrics. Granted, some of these albums were found to be instrumental records, and therefore did not contain any lyrics; and other albums simply were not covered by the Genius lyrical database. However, it did seem apparent that the appearance of punctuation in the title of an album or the name of its artist was causing invalid Genius URL's to be generated during the initial lyrics scraping process discussed above. Therefore, additional lyrics scraping had to be performed.

More specifically, two new algorithms for lyrics scraping were developed. The first of these (which will be referred to as the "normal alternative" algorithm) accounts for the presence of ampersands in album titles or album artist names by replacing all ampersands with the word "and." (For example, if the album *Look Now* by Elvis Costello & The Imposters were to be examined, running this algorithm would lead to "https://genius.com/albums/Elvis-costello-and-the-imposters/Look-now" being generated as opposed to "http://genius.com/albums/Elvis-costello---the-imposters/Look-now" like before.) Additionally, this algorithm removes all other punctuation marks as opposed to replacing punctuation marks with spaces. (This was designed to handle specific edge cases such as the one illustrated by *DAMN.* by Kendrick Lamar. Using the initial algorithm, the period at the end of *DAMN.* would be replaced by a space, which in turn led to http://genius.com/albums/Kendrick-lamar/Damn- being generated instead of http://genius.com/albums/Kendrick-lamar/Damn as desired.)

In addition to this updated algorithm, a second algorithm was created, which simply used manually scraped URL's to access lyrics data. In other words, for each album that still did not have any stored lyrics after the initial and the "normal alternative" algorithms were executed, the correct URL (if one existed) was manually recovered through web searches, and subsequently recorded. This process therefore ensured that as much lyrics data as possible was scraped.

Finally, following the secondary wave of lyrics scraping, the updated list of dictionaries was converted (using Pandas) into a Pandas DataFrame object for further data analysis purposes.

### If people are involved, were they aware of this data collection and if so, what purpose did they expect the data to be used for?

No people were involved in the process of collecting data (in terms of getting surveyed for personal information), as all data was simply scraped off of web pages already available online.

### Where can your raw source data be found, if applicable?

The raw, unprocessed data scraped from Metacritic is available [here](https://github.com/genghisshyy/INFO_2950_FinalProject/blob/master/data/metacritic_data.csv), while the final processed dataset can be found [here](https://github.com/genghisshyy/INFO_2950_FinalProject/blob/master/data/albums.csv).

<div style="border-bottom: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold;"></div>

# Data Analysis

<div style="border-bottom: 4px solid #AAA; padding-bottom: 6px; font-size: 16px; font-weight: bold;"></div>

# Acknowledgements

All works cited are presented below:

Bannister, Mark. "The Billboard Hot 100: Exploring Six Decades of Number One 
     Singles." GitHub, Apr. 2017, [Repository Link](https://github.com/mspbannister/dand-p4-billboard/blob/master/Billboard_analysis__100417_.md#the-billboard-hot-100-exploring-six-decades-of-number-one-singles). Accessed 16 May 2020.
     
"Best Music and Albums for 2020." Metacritic, [Page Link](https://www.metacritic.com/browse/albums/score/metascore/year/filtered?view=condensed&sort=desc). Accessed 16 May 2020.

"Chart History: Prince." Billboard, [Page Link](www.billboard.com/music/prince/chart-history/HSI/song/333454). Accessed 16 May 2020.

Ross, Eleanor. "Parental Advisory: How Songs with Explicit Content Came to 
     Dominate the Charts." Newsweek, 13 Apr. 2017, [Page Link](https://www.newsweek.com/songs-explicit-lyrics-popular-increase-billboard-spotify-583551). Accessed 16 May 2020. 