# Natural Language Processing - Pitchfork Music Reviews

I will be using NLP techniques, both regression and classification, to see if music reviews can be used to determine the review score or music genre. Before starting this I need to build a web scraper to obtain the required data.

### Pitchfork Webscraper Build

In [46]:
import requests
import random
from bs4 import BeautifulSoup
import pandas as pd
import re
import time


# list of user agents to resolve 403 forbidden error
userAgents = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.1',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.3',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.3']

# Function to clean and extract the desired text from the <a> and <em> tags located in the review_text
def extract_text(soup):
    for a_tag in soup.find_all('a'):
        if a_tag.find('em'):
            # Replace <a> with its <em> content
            a_tag.replace_with(a_tag.em.text)
        else:
            # Remove the entire <a> tag but keep its text content
            a_tag.unwrap()

    return soup.get_text()


# function to extract review data from each url
def extract_review_data(url):
    html_data = requests.get(url, headers={'User-Agent': random.choice(userAgents)})
    # create beautifulsoup object
    soup = BeautifulSoup(html_data.content, "html.parser")
    
    # Find relevant review elements
    intro_text = soup.find_all("div", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ SplitScreenContentHeaderDekDown-csTFQR iUEiRd esultD MVQMg")
    review_text = soup.find_all("div", class_="body__inner-container")
    genre = soup.find_all("p", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ InfoSliceValue-tfmqg iUEiRd gpUuZE fkSlPp")
    #score_element = soup.find_all("div", class_="ScoreCircle-jAxRuP akdGf")
    score_element = soup.find_all("div", class_=re.compile(r"^ScoreCircle-"))
    
    # Clean the intro and review body text
    cleaned_intro = extract_text(intro_text[0]) if intro_text else "N/A"
    cleaned_review = extract_text(review_text[0]) if review_text else "N/A"
    cleaned_genre = genre[0].get_text().strip() if genre else "N/A"
    cleaned_score = score_element[0].find("p").get_text().strip() if score_element else "N/A"
    
    # Return the collected data
    return {
        "Text": cleaned_intro + cleaned_review,
        "Genre": cleaned_genre,
        "Score": cleaned_score
    }

In [None]:
# Prepare an empty list to store the extracted data
data = []

In [48]:
# define urls
urls = ["https://pitchfork.com/reviews/albums/how-to-dress-well-i-am-toward-you/", "https://pitchfork.com/reviews/albums/lou-reed-the-blue-mask/", 
        "https://pitchfork.com/reviews/albums/cloud-nothings-final-summer/"]

In [49]:
# Loop over each URL and extract the review data
for url in urls:
    review_data = extract_review_data(url)
    data.append(review_data)
    
    # Add a delay to avoid getting blocked
    time.sleep(2)

# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(data)

In [50]:
df

Unnamed: 0,Text,Genre,Score
0,"Finally reissued in full, the 1996 debut from ...",Electronic,8.3
1,The Philly group’s second live album is a cele...,Rock,7.9
2,Filtered through warped hip-hop beats and garb...,Rock,7.6
3,"On his first album in six years, Tom Krell shr...",Pop/R&B,7.1
4,"Each Sunday, Pitchfork takes an in-depth look ...",Rock,9.2
5,With spruced-up production highlighting new su...,Rock,7.5


### Data Analysis

Shown below is a high level analysis of the collected data.
Distribution of review scores
Number of reviews
Word cloud broken out by score range

### NLP