## Visualizing Genre Trends

After imputing genre using Multinomial Naive Bayes, I had almost all the information I needed to visualize Pitchfork's genre trends. I love end-of-year album rankings, and my dataset didn't have any of that information. So I figured I should get that info, and it turns out it that albumoftheyear.org—an aggregator of most music publications—had everything in an easily scrapable format. 

### Step 1: Find list information from Album of the Year

The steps for scraping albumoftheyear.org for Pitchfork's lists is very similar to my initial web scrape, so I won't go into much detail here.

In [1]:
#Import necessary modules
import urllib.request as ur #Handles URLs
from bs4 import BeautifulSoup #Parses webpage content
import requests
from lxml import html
import numpy as np
import pandas as pd
import time

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
#Get the base URL used for Pitchfork Lists
base_url = "https://www.albumoftheyear.org/publication/1-pitchfork/lists/"
hdr = {'User-Agent': 'Mozilla/5.0'}
full_url_req = ur.Request(base_url, headers=hdr)
full_url_response = ur.urlopen(full_url_req)
soup = BeautifulSoup(full_url_response)
sections = soup.find("div", class_="section")
a_tags = sections.find_all("a")
list_urls = []
for a_tag in a_tags:
    #Grab the link in the "href" of the a tag
    current_href=a_tag.get('href')
    list_urls.append('https://www.albumoftheyear.org' + current_href)
    
#Remove duplicate URLs
list_urls = list(set(list_urls))

In [3]:
#Define function to clean any html tags
import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

Quick note about the following code: Album of the Year uses one, hyphen-delimited HTML tag for each Album/Artist combination. Splitting each string by the delimiter worked for most cases, but two album titles also contained hyphens. I worked around this issue with some if-else logic which I then printed to double-check everything looked alright.

In [4]:
#Initiate lists for the outcome variables we want
rank = []
title = []
artist = []
list_year = []
review_url = []

#Iterate through each year-end list and append pertinent information to list variables
for current_list_url in list_urls:
    #Get year from the end of each URL
    current_list_year = current_list_url[-5:-1]
    list_url = current_list_url
    list_url_req = ur.Request(list_url, headers=hdr)
    list_url_response = ur.urlopen(list_url_req)
    soup = BeautifulSoup(list_url_response)
    row_items = soup.find_all("div", class_="albumListRow")
    list_items = soup.find_all("h2", class_="albumListTitle")
    blurbs = soup.find_all("div", class_="albumListBlurbLink")
    for item in list_items:
        rank_html = item.find("span", class_="albumListRank")
        artist_title_html = item.find("a")
        artist_title = cleanhtml(str(artist_title_html)).split(' - ')
        #Two albums have dashes in them, so can concatenate the 2nd and 3rd strings
        if len(artist_title) > 2:
            print("More dashes for %s" %artist_title)
            title.append(artist_title[1] + " - " + artist_title[2])
        else:
            title.append(artist_title[1])
        artist.append(artist_title[0])
        rank.append(re.sub('. ', '',cleanhtml(str(rank_html))))
        list_year.append(current_list_year)
    #Get the URL for matching with the review data
    for row in row_items:
        blurb = row.find("div", class_="albumListBlurbLink")
        try:
            url = blurb.find("a").get("href").replace('www.','')
        except:
            url = "None"
        review_url.append(url)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


More dashes for ['Múm', 'Yesterday Was Dramatic', 'Today Is OK']
More dashes for ['A.A.L. (Against All Logic)', '2012', '2017']


In [5]:
#Create dataframe of artist, album title, rank, and year
lists_df = pd.DataFrame({'Artist': artist,
                              'Album': title,
                              'List_Rank': list(map(int, rank)),
                              'List_Year': list_year,
                              'URL': review_url})

#Sort data by rank and list_year, then write to pickle
lists_df = lists_df.sort_values(by=['List_Year','List_Rank'], ascending=False)

### Step 2: Join Album of the Year data to review data

Next, I joined the list data to my review data (with imputed genres) using stripped-down versions of the URL from each dataframe.

In [6]:
#Read review_df from pickle
review_df = pd.read_pickle('p4k_cleansed_imputed_genre.pkl')

#Create uniform URLs
import re
def replaceEndSlash(url):
    return(re.sub(r"/$", "", url))

review_df['URL'] = review_df['URL'].str.replace('https://', '')
review_df['URL'] = review_df['URL'].str.replace('http://', '')
review_df['URL'] = review_df['URL'].str.replace('www.', '')

lists_df['URL'] = lists_df['URL'].str.replace('https://', '')
lists_df['URL'] = lists_df['URL'].str.replace('http://', '')
lists_df['URL'] = lists_df['URL'].str.replace('www.', '')

review_df['URL'] = review_df['URL'].apply(replaceEndSlash)
lists_df['URL'] = lists_df['URL'].apply(replaceEndSlash)

lists_df = lists_df.drop(columns=['Artist','Album'])

#Join lists_df to review_df, removing duplicates
full_df = pd.merge(review_df, lists_df,  how='left', left_on='URL', right_on = 'URL').drop_duplicates()

### Step 3: Transform data from wide to long format for quick calculations

After some miscellaneous data cleansing, I uploaded a slim version of the data containing only structured data to Github. I then used that data to create an aggregated summary of the data in long format (which made it easier to work with Dash). I was primarily interested in viewing two measures by year:

* Proportion of Best New Music albums that fall into each genre
* Album of the Year points

Album of the Year uses the following breakdown:

*1st Place*: 10 points  
*2nd Place*: 8 points  
*3rd Place*: 6 points  
*4-10 Place*: 5 points  
*11-25 Place*: 3 points  
*Other Place*: 1 point

I filtered out data before 2003—the year pitchfork introduced the Best New Music label.

In [7]:
#Read Data
full_df = pd.read_csv('https://raw.githubusercontent.com/cameron-marsden/pitchfork_analysis/master/p4k_slim_imputed_genre.csv')

#Get Information about each year's BNM makeup

#Get a binary variable for BNM
full_df['BNM_Indicator'] = [1 if x == 'Best new music' else 0 for x in full_df['BNM']]

#Ensure List_Rank and year are numeric
full_df['List_Rank'] = pd.to_numeric(full_df['List_Rank'], errors='coerce').fillna(0).astype(np.int64)
full_df['year'] = pd.to_numeric(full_df['List_Year'], errors='coerce').fillna(0).astype(np.int64)

#Get an aggregated data set for counts/proportions of new albums/BNM and another for list information
#Aggregate data by year and genre, generating a count of albums and count of BNM
albums_df = pd.DataFrame(full_df.groupby(['Review_Year', 'Genre']).agg({'New_Album': 'sum', 'BNM_Indicator': 'sum'})).reset_index()
albums_df.columns = ['year', 'genre', 'genre_n_albums', 'genre_bnm']

#Aggregate by just Year to get total number of BNM
total_bnm_year_df = pd.DataFrame(full_df.groupby(['Review_Year']).agg({'BNM_Indicator': 'sum'})).reset_index()
total_bnm_year_df.columns = ['year', 'total_bnm']

#Join year aggregated data to year+genre aggregated data
albums_df = albums_df.merge(total_bnm_year_df, left_on=['year'], right_on=['year'], how='left')

#Assign variable indicating the proportion of ALL albums that are given BNM by genre and year.
#Consider this variable as the conditional probability that an album is BNM given its genre and year
#i.e., Pr(BNM=1 | Genre, Year)
albums_df['prob_bnm_given_genre'] = albums_df['genre_bnm']/albums_df['genre_n_albums']

#Assign variable indicating the proportion of BNM albums that fall within each genre by year
#Consider this variable as the conditional probability that an album is a certain genre given that it was BNM
#i.e., Pr(Genre=X | BNM=1, Year)
albums_df['prob_genre_given_bnm'] = albums_df['genre_bnm']/albums_df['total_bnm']

#Filter albums_df to 2003-2018. (BNM started in 2003)
albums_df = albums_df[(albums_df['year'] >= 2003) & (albums_df['year'] < 2019)].reset_index(drop=True)

In [8]:
#Get each year's list information
list_df = full_df[(full_df['List_Rank'] > 0)]

#Remove duplicate year+rank (i.e., albums from the same artist considered as one spot)
list_df = list_df.drop_duplicates(subset=['year', 'List_Rank'], keep='first', inplace=False)

#Assign points according to albumoftheyear.org scoring system
def assignPoints(row):
    if row['List_Rank']==1:
        return 10
    elif row['List_Rank']==2:
        return 8
    elif row['List_Rank']==3:
        return 6
    elif row['List_Rank'] < 11:
        return 5
    elif row['List_Rank'] < 26:
        return 3
    elif row['List_Rank'] <= 50:
        return 1    
    else:
        return np.nan
    
list_df['List_Points'] = list_df.apply(assignPoints, axis=1)

#Aggregate data by genre and list year
list_df = pd.DataFrame(list_df.groupby(['year', 'Genre']).agg({'List_Points': 'sum'})).reset_index()
list_df.columns = ['year', 'genre', 'points']

#Filter lists_df to 2003-2018 to match albums_df
list_df = list_df[(list_df['year'] >= 2003) & (list_df['year'] < 2019)].reset_index(drop=True)

In [9]:
#Join the two aggregated dataframes together
agg_df = albums_df.merge(list_df, left_on=['year', 'genre'], right_on=['year', 'genre'], how='outer')
agg_df['points'] = agg_df['points'].fillna(0)
agg_df.columns = ['Year', 'Genre', 'New Album Count', 'Best New Music Count by Genre and Year',
                  'Best New Music Count by Year', 'Proportion BNM Given Genre',
                  'Proportion Genre Given BNM', 'Sum of AOTY Points']
agg_df = agg_df.reset_index()

In [10]:
#Transform into long format as final dataset
p4k_summary_long_format = pd.melt(agg_df, id_vars=['Year', 'Genre'], value_vars=['New Album Count', 'Best New Music Count by Genre and Year',
                                                                                 'Best New Music Count by Year', 'Proportion BNM Given Genre',
                                                                                 'Proportion Genre Given BNM', 'Sum of AOTY Points'])
p4k_summary_long_format.columns = ['Year', 'Genre', 'Indicator Name', 'Value']

### Step 4: Visualize results using Dash

<a href="https://skewednotes.pythonanywhere.com">Click to View Visual</a>

This Dash visual shows Pitchfork's distribution of Best New Music labels and aggregated year-end rankings (scored using Album of the Year's rubric) broken down by genre. Trends for niche genres like "Global" and "Jazz" yield little information due to small sample sizes. On the other hand, more substantive genres like Pop/R&B and Experimental show shifting sentiments over the past 18 years:

* Experimental music has suffered declines in review accolades and end-year lists since Pitchfork first introduced their 'Best New Music' stamp in 2003. During that year, experimental albums made up 23% of all 'Best New Music' titles compared to 7% in 2018.

* Pitchfork went through a "Poptimist" renaissance in 2010—jumping from essentially zero recognition in 2009 to comprising one-fifth of all best new albums in 2010. The genre has continued to succeed thorughout the past decade, peaking in 2016 when Pop/R&B albums dominated the year-end list with Solange, Frank Ocean, and Beyoncé taking the top three spots. 

* Rap had moderate success in the early-mid 2000's, but it's presence in postive reviews and year-end lists dwindled through the late 2000s. Similar to Pop/R&B, Pitchfork reviewed Rap albums much more positively during the past decade. (Kanye West's *My Beautiful Dark Twisted Fantasy* still remains as the decade's only new album to achieve a perfect 10/10 review.) Recently, Pitchfork placed 14 rap albums in their top 50 albums of 2018.

So has Pitchfork shifted its focus to more mainstream genres during the past decade? Perhaps. The data I scraped and modeled certainly supports a shifting perspective. I personally believe that artists have produced more genre-bending albums recently. For example, St. Vincent and Deafheaven have landed on three of the decade's year-end lists. Pitchfork defines St. Vincent's albums as Rock and Deafheaven's as Metal. I would argue that St. Vincent's 2011 album *Strange Mercy* is as much experimental as it is rock. Annie Clark herself refers to her most recent album, *MASSEDUCTION*, as a confessional pop record. It's just as difficult to classify Deafheaven. Are they a black metal band? According to many black metal fans on the black metal subreddit, No. Wikipedia suggests that the band's genre is "blackgaze" (a portmanteau of "black metal" and "shoegaze"). I would argue that their sound is more akin to post-rock than anything else. 

It was fun to classify and analyze the nine genres Pitchfork uses to tag album reviews, and it's easy to communicate your music preferences to others using broad terms like "Country" or "Rap." On the other hand, these rigid genres often turn people away from music they might otherwise enjoy. I don't think that Pitchfork's views in 2010 changed as drastically as the visualization above suggests. Instead, I think our definition of genre rapidly became outdated in a world that is always sharing ideas online. Perhaps when I revisit this dataset, I'll look into unsupervised methods to create my own definition of genre—one that generates categories like "blackgaze" to better represent nuanced albums. 