# Natural Language Processing - Pitchfork Music Reviews

I will be using NLP techniques, both regression and classification, to see if music reviews can be used to determine the review score or music genre. Before starting this I need to build a web scraper to obtain the required data.

### Pitchfork Webscraper Build

In [1]:
import requests
import random
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


# list of user agents to resolve 403 forbidden error
userAgents = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.1',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.3',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.3']

# Function to clean and extract the desired text from the <a> and <em> tags located in the review_text
def extract_text(soup):
    for a_tag in soup.find_all('a'):
        if a_tag.find('em'):
            # Replace <a> with its <em> content
            a_tag.replace_with(a_tag.em.text)
        else:
            # Remove the entire <a> tag but keep its text content
            a_tag.unwrap()

    return soup.get_text()


# function to extract review data from each url
def extract_review_data(url):
    html_data = requests.get(url, headers={'User-Agent': random.choice(userAgents)})
    # create beautifulsoup object
    soup = BeautifulSoup(html_data.content, "html.parser")
    
    # Find relevant review elements
    intro_text = soup.find_all("div", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ SplitScreenContentHeaderDekDown-csTFQR iUEiRd Byyns MVQMg")
    review_text = soup.find_all("div", class_="body__inner-container")
    genre = soup.find_all("p", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ InfoSliceValue-tfmqg iUEiRd hUQWfW fkSlPp")
    #score_element = soup.find_all("div", class_="ScoreCircle-jAxRuP akdGf")
    score_element = soup.find_all("div", class_=re.compile(r"^ScoreCircle-"))
    
    # Clean the intro and review body text
    cleaned_intro = extract_text(intro_text[0]) if intro_text else "N/A"
    cleaned_review = extract_text(review_text[0]) if review_text else "N/A"
    cleaned_genre = genre[0].get_text().strip() if genre else "N/A"
    cleaned_score = score_element[0].find("p").get_text().strip() if score_element else "N/A"
    
    # Return the collected data
    return {
        "Text": cleaned_intro + ' ' + cleaned_review,
        "Genre": cleaned_genre,
        "Score": cleaned_score
    }

In [2]:
# Prepare an empty list to store the extracted data
data = []

In [3]:
# define urls
url_df = pd.read_csv("pitchfork_urls.csv", header=None)
# Extract the data from the column into a list
url_list = url_df.values.tolist()
# Flatten nested list
url_list = [item for sublist in url_list for item in sublist]

In [4]:
url_list

['https://pitchfork.com/reviews/albums/jamie-xx-in-waves/',
 'https://pitchfork.com/reviews/albums/laila-gap-year/',
 'https://pitchfork.com/reviews/albums/nidia-and-valentina-estradas/',
 'https://pitchfork.com/reviews/albums/the-war-on-drugs-live-drugs-again/',
 'https://pitchfork.com/reviews/albums/wendy-eisenberg-viewfinder/',
 'https://pitchfork.com/reviews/albums/nilufer-yanya-my-method-actor/',
 'https://pitchfork.com/reviews/albums/porches-shirt/',
 'https://pitchfork.com/reviews/albums/callahan-and-witscher-think-differently/',
 'https://pitchfork.com/reviews/albums/foxing-foxing/',
 'https://pitchfork.com/reviews/albums/basic-this-is-basic/',
 'https://pitchfork.com/reviews/albums/hayden-pedigo-live-in-amarillo-texas/',
 'https://pitchfork.com/reviews/albums/julie-my-anti-aircraft-friend/',
 'https://pitchfork.com/reviews/albums/phiik-lungs-carrot-season/',
 'https://pitchfork.com/reviews/albums/chow-lee-sex-drive/',
 'https://pitchfork.com/reviews/albums/basic-channel-bcd/',

In [6]:
# Loop over each URL and extract the review data
x=0
for url in url_list:
    x+=1 # increment
    review_data = extract_review_data(url)
    data.append(review_data)
    # Check if x is a multiple of 100
    if x % 100 == 0:
        print("Extracting from URL number:", x)
    
    # Add a delay to avoid getting blocked
    time.sleep(2)

# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(data)

Extracting from URL number: 100
Extracting from URL number: 200
Extracting from URL number: 300
Extracting from URL number: 400
Extracting from URL number: 500
Extracting from URL number: 600
Extracting from URL number: 700
Extracting from URL number: 800


In [17]:
df

Unnamed: 0,Text,Genre,Score
0,"Ten years after his big solo debut, the UK pro...",Electronic,7.3
1,Riding the success of singles “Like That!” and...,Pop,7.2
2,"On their debut collaboration, the beatmaker an...",Electronic,7.8
3,The Philly group’s second live album is a cele...,Rock,7.9
4,Laser eye surgery enabled the guitarist to see...,American,7.7
...,...,...,...
800,The reggae veteran’s new studio album doesn’t ...,Pop,6.7
801,In diaphanous compositions like color field pa...,Experimental,7.5
802,The Singaporean band’s new album showcases a p...,Rock,7.2
803,"Each Sunday, Pitchfork takes an in-depth look ...",Folk,9.6


In [16]:
# set Genre to only keep first genre, split text based on / and white space
df['Genre'] = df['Genre'].str.split(pat='[/ ]', n=1).str[0]

In [18]:
# Export data file to desktop
df.to_csv('/Users/simoncrouch/Desktop/review_data.csv')

### Data Analysis

Shown below is a high level analysis of the collected data.
Number of reviews and number by genre + average score of each genre
Histogram of scores

### NLP

Word cloud broken out by score range