# Natural Language Processing - Pitchfork Music Reviews

I will be using NLP techniques, both regression and classification, to see if music reviews can be used to determine the review score or music genre. Before starting this I need to build a web scraper to obtain the required data.

### Pitchfork Webscraper Build
Data Extraction

In [2]:
import requests
import random
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
import plotly.express as px
import plotly.figure_factory as ff


# list of user agents to resolve 403 forbidden error
userAgents = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2.1 Safari/605.1.1',
              'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.3 Safari/605.1.1',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.3',
              'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.3']

# Function to clean and extract the desired text from the <a> and <em> tags located in the review_text
def extract_text(soup):
    for a_tag in soup.find_all('a'):
        if a_tag.find('em'):
            # Replace <a> with its <em> content
            a_tag.replace_with(a_tag.em.text)
        else:
            # Remove the entire <a> tag but keep its text content
            a_tag.unwrap()

    return soup.get_text()


# function to extract review data from each url
def extract_review_data(url):
    html_data = requests.get(url, headers={'User-Agent': random.choice(userAgents)})
    # create beautifulsoup object
    soup = BeautifulSoup(html_data.content, "html.parser")
    
    # Find relevant review elements
    intro_text = soup.find_all("div", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ SplitScreenContentHeaderDekDown-csTFQR iUEiRd Byyns MVQMg")
    review_text = soup.find_all("div", class_="body__inner-container")
    genre = soup.find_all("p", class_="BaseWrap-sc-gjQpdd BaseText-ewhhUZ InfoSliceValue-tfmqg iUEiRd hUQWfW fkSlPp")
    #score_element = soup.find_all("div", class_="ScoreCircle-jAxRuP akdGf")
    score_element = soup.find_all("div", class_=re.compile(r"^ScoreCircle-"))
    
    # Clean the intro and review body text
    cleaned_intro = extract_text(intro_text[0]) if intro_text else "N/A"
    cleaned_review = extract_text(review_text[0]) if review_text else "N/A"
    cleaned_genre = genre[0].get_text().strip() if genre else "N/A"
    cleaned_score = score_element[0].find("p").get_text().strip() if score_element else "N/A"
    
    # Return the collected data
    return {
        "Text": cleaned_intro + ' ' + cleaned_review,
        "Genre": cleaned_genre,
        "Score": cleaned_score
    }

In [4]:
# Prepare an empty list to store the extracted data
data = []

In [5]:
# define urls
url_df = pd.read_csv("pitchfork_urls.csv", header=None)
# Extract the data from the column into a list
url_list = url_df.values.tolist()
# Flatten nested list
url_list = [item for sublist in url_list for item in sublist]

In [6]:
url_list

['https://pitchfork.com/reviews/albums/jamie-xx-in-waves/',
 'https://pitchfork.com/reviews/albums/laila-gap-year/',
 'https://pitchfork.com/reviews/albums/nidia-and-valentina-estradas/',
 'https://pitchfork.com/reviews/albums/the-war-on-drugs-live-drugs-again/',
 'https://pitchfork.com/reviews/albums/wendy-eisenberg-viewfinder/',
 'https://pitchfork.com/reviews/albums/nilufer-yanya-my-method-actor/',
 'https://pitchfork.com/reviews/albums/porches-shirt/',
 'https://pitchfork.com/reviews/albums/callahan-and-witscher-think-differently/',
 'https://pitchfork.com/reviews/albums/foxing-foxing/',
 'https://pitchfork.com/reviews/albums/basic-this-is-basic/',
 'https://pitchfork.com/reviews/albums/hayden-pedigo-live-in-amarillo-texas/',
 'https://pitchfork.com/reviews/albums/julie-my-anti-aircraft-friend/',
 'https://pitchfork.com/reviews/albums/phiik-lungs-carrot-season/',
 'https://pitchfork.com/reviews/albums/chow-lee-sex-drive/',
 'https://pitchfork.com/reviews/albums/basic-channel-bcd/',

In [6]:
# Loop over each URL and extract the review data
x=0
for url in url_list:
    x+=1 # increment
    review_data = extract_review_data(url)
    data.append(review_data)
    # Check if x is a multiple of 100
    if x % 100 == 0:
        print("Extracting from URL number:", x)
    
    # Add a delay to avoid getting blocked
    time.sleep(2)

# Convert the list of dictionaries into a DataFrame
df = pd.DataFrame(data)

Extracting from URL number: 100
Extracting from URL number: 200
Extracting from URL number: 300
Extracting from URL number: 400
Extracting from URL number: 500
Extracting from URL number: 600
Extracting from URL number: 700
Extracting from URL number: 800


In [2]:
df

NameError: name 'df' is not defined

In [16]:
# set Genre to only keep first genre, split text based on / and white space
df['Genre'] = df['Genre'].str.split(pat='[/ ]', n=1).str[0]

In [18]:
# Export data file to desktop
df.to_csv('/Users/simoncrouch/Desktop/review_data.csv', index=False)
# check that index=False has worked

### Data Analysis

Shown below is a high level analysis of the collected data.
Number of reviews and number by genre + average score of each genre
Histogram of scores

In [13]:
# Data loaded so that we do not need to re-run the above. DELETE OR COMMENT OUT LATER
df = pd.read_csv("review_data.csv")

In [14]:
print("Our data set contains the text for", len(df), "reviews.","\nDisplayed below are two plots showing the breakdown of reviews by genre and score.")

Our data set contains the text for 805 reviews. 
Displayed below are two plots showing the breakdown of reviews by genre and score.


In [15]:
# remove second index list
df = df.iloc[:, 1:]

In [16]:
df

Unnamed: 0,Text,Genre,Score
0,"Ten years after his big solo debut, the UK pro...",Electronic,7.3
1,Riding the success of singles “Like That!” and...,Pop,7.2
2,"On their debut collaboration, the beatmaker an...",Electronic,7.8
3,The Philly group’s second live album is a cele...,Rock,7.9
4,Laser eye surgery enabled the guitarist to see...,American,7.7
...,...,...,...
800,The reggae veteran’s new studio album doesn’t ...,Pop,6.7
801,In diaphanous compositions like color field pa...,Experimental,7.5
802,The Singaporean band’s new album showcases a p...,Rock,7.2
803,"Each Sunday, Pitchfork takes an in-depth look ...",Folk,9.6


In [17]:
null_genre_rows = df[df['Genre'].isna()]
print(null_genre_rows)

Empty DataFrame
Columns: [Text, Genre, Score]
Index: []


In [18]:
genre_counts = df['Genre'].value_counts()
print(genre_counts)

Rock            240
Electronic      141
Rap             131
Pop             117
Experimental     77
Folk             50
Jazz             27
Metal            12
NTS               1
Mappa             1
Time              1
Other             1
Strut             1
A24               1
Rvng              1
Rawkus            1
American          1
The               1
Name: Genre, dtype: int64


In [19]:
# view rows for genres that only appear once
genres_less_than_10 = genre_counts[genre_counts < 2]
not_genre = df[df['Genre'].isin(genres_less_than_10.index)]
not_genre

Unnamed: 0,Text,Genre,Score
4,Laser eye surgery enabled the guitarist to see...,American,7.7
60,"Each Sunday, Pitchfork takes an in-depth look ...",Rawkus,8.6
105,This invigorating compilation of African dance...,Strut,7.7
217,The Los Angeles composer—son of late Fluxus co...,Rvng,7.8
267,In keeping with the film’s surreal take on nos...,A24,7.1
361,NTS’ latest compilation is a brain-frying roll...,NTS,7.8
548,Led by Nicolás Jaar and featuring Angel Bat Da...,Other,7.5
574,A new compilation surveys the period in Japan ...,Time,7.2
604,Experimental label mappa editions’ elegiac col...,Mappa,7.3
739,Gathering vintage compositions from 27 tape re...,The,8.0


After investigating the Genres data, I found that these album do not have listed genre and so the first word of the record label has been scraped in their place. These rows will be dropped.

In [21]:
# drop rows with no genre
df = df.drop(not_genre.index)

In [33]:
genre_counts

Rock            240
Electronic      141
Rap             131
Pop             117
Experimental     77
Folk             50
Jazz             27
Metal            12
Name: Genre, dtype: int64

In [91]:
# re-run the counts
genre_counts = df['Genre'].value_counts()

fig = px.bar(genre_counts, x='Genre', template='plotly_white', title='Number of Reviews by Genre',
             labels={'Genre':'Number of reviews', 'index':'Genre'})
fig.update_traces(marker_color='rgb(135,206,235)')

fig.show()


In [37]:
print("The average review score for each genre is:", df.groupby('Genre')['Score'].mean().round(1))

The average review score for each genre is: Genre
Electronic      7.6
Experimental    7.7
Folk            7.6
Jazz            7.9
Metal           7.8
Pop             7.5
Rap             7.1
Rock            7.6
Name: Score, dtype: float64


In [101]:
# Create distplot with curve_type set to 'normal'
score_data = [df['Score'].tolist()]

fig = ff.create_distplot(score_data, ['Score'], colors=['#87ceeb'],
                         bin_size=.5, show_hist=False, show_rug=False,)

# Add title
fig.update_layout(title_text='Hist and Curve Plot')
fig.show()

In [119]:
# made using plotly express
# look into adjusting the colors etc
fig = px.histogram(df, x='Score', title='Spread of Review Scores')
fig.update_traces(marker_color='rgb(135,206,235)', xbins=dict(start=0.0, end=10.1, size=0.5))
fig.update_layout(yaxis_title_text='Count')

fig.show()


## NLP

### Text pre-processing
- lower casing
- tokenisation
- remove punctuation
- stop words
- lemmatisation

In [25]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# convert all review text to lowercase
df['Text'] = df['Text'].str.lower()

# split text into individual tokens
df['tokens'] = df['Text'].apply(word_tokenize)
# drop Text column
#df = df.drop(['Text'], axis = 1)

# remove punctuation
df['tokens'] = df['tokens'].apply(
    lambda tokens: [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens])

# remove stop words - editing ntlk's list
stop_words = set(stopwords.words('english')) - {'no', 'not'}
df['tokens'] = df['tokens'].apply(lambda tokens: [w for w in tokens if not w in stop_words])

# lemmatise text
lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens'].apply(lambda tokens: [lemmatizer.lemmatize(w) for w in tokens])

### Text Classification
Aim to identify the most important words for each genre

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [145]:
def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)
    return text 

df['Text'] = df['Text'].apply(clean_text)

In [146]:
#stop_words = set(stopwords.words('english')) - {'no', 'not'}



# Define Lemmatizing and Tokenizing function
class WordLemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

# Define vectorizer object
vectorizer=TfidfVectorizer(analyzer='word',
                           input='content',
                           lowercase=True,
                           stop_words= set(stopwords.words('english')) - {'no', 'not'},
                           min_df=3,
                           ngram_range=(1,1),
                           tokenizer=WordLemmaTokenizer())

In [147]:
# Fit and transform the text data
X_tfidf = vectorizer.fit_transform(df['Text'])


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'re", "'s", "'ve", 'could', 'doe', 'ha', 'might', 'must', "n't", 'need', 'sha', 'wa', 'wo', 'would'] not in stop_words.



In [148]:
feature_names = vectorizer.get_feature_names_out()

In [149]:
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)

In [150]:
tfidf_df

Unnamed: 0,00s,1,10,100,1000,10minute,10th,11,11th,12,...,zine,zines,zinger,zip,zither,zone,zonked,zoom,zulu,à
0,0.0,0.000000,0.054406,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.070506,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
790,0.0,0.055963,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
791,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
792,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
793,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [151]:
tfidf_df['Genre'] = df['Genre']
genre_word_scores = tfidf_df.groupby('Genre').sum()

In [152]:
top_n = 20  # Adjust this value to get more or fewer top words
important_words_by_genre = {}

for genre in genre_word_scores.index:
    # Get the top N words for this genre
    sorted_words = genre_word_scores.loc[genre].sort_values(ascending=False)[:top_n]
    important_words_by_genre[genre] = sorted_words.index.tolist()

# Now, important_words_by_genre contains the top N most important words for each genre
for genre, words in important_words_by_genre.items():
    print(f"Top {top_n} words for {genre}: {words}")


Top 20 words for Electronic: ['like', 'album', 'song', 'wa', 'band', 'music', 'sound', 'ha', 'new', 'record', 'one', 'feel', 'time', 'guitar', 'year', 'track', 'make', 'first', 'even', 'not']
Top 20 words for Experimental: ['album', 'like', 'band', 'song', 'new', 'ha', 'sound', 'music', 'wa', 'guitar', 'not', 'one', 'first', 'time', 'rock', 'record', 'year', 'pop', 'voice', 'track']
Top 20 words for Folk: ['like', 'album', 'song', 'pop', 'sound', 'ha', 'music', 'wa', 'vocal', 'track', 'new', 'one', 'feel', 'band', 'voice', 'beat', 'guitar', 'not', 'rb', 'record']
Top 20 words for Jazz: ['album', 'like', 'brown', 'music', 'sound', 'wa', 'song', 'beat', 'not', 'age', 'year', 'glass', 'one', 'rapper', 'ha', 'new', 'guitar', 'feel', 'vocal', 'band']
Top 20 words for Metal: ['like', 'music', 'peso', 'album', 'voice', 'singer', 'song', 'sound', 'guitar', 'new', 'artist', 'gambino', 'donald', 'spring', 'childish', 'cover', 'wa', 'snow', 'one', 'reign']
Top 20 words for Pop: ['like', 'album', 

# HERE
Visualise the above! I think Metal stands out on its own.
Visualization: Use visualizations such as word clouds or bar charts to see word frequency distributions for each genre, which might help in understanding the differences visually.

Word Embeddings: Consider using word embeddings (like Word2Vec or GloVe) to capture semantic similarities, which might help in understanding which words are more contextually unique to certain genres.

Could re-run with bi-grams included and see if that improves things.

A lot of similar words used across genres (overlapping themes, generic vocabulary / defined review language) so trying chi^2 to see if I can identify words that are statistically significant between genres.


In [None]:
from sklearn.feature_selection import chi2

Even Chi^2 supports this. In that a lot of the unique words are specific to Metal and Rap.

Why: Chi-square tests are often used in text classification to evaluate whether the presence of a word is independent of a genre (target variable). If a word is strongly associated with a genre, it will have a high chi-square value.

In [161]:
# Fit and transform the text data
X = vectorizer.fit_transform(df['Text'])
y = df['Genre']

chi2_scores, p_values = chi2(X, y)
feature_names = vectorizer.get_feature_names_out()

# Combine feature names with chi2 scores and sort by importance
chi2_df = pd.DataFrame({"Feature": feature_names, "Chi2_Score": chi2_scores})
chi2_df = chi2_df.sort_values(by="Chi2_Score", ascending=False)

print(chi2_df.head(30))  # Top 10 most important words across all genres

           Feature  Chi2_Score
6246         metal   64.169152
4239        gendel   38.130623
7839           rap   35.727178
7844        rapper   28.264811
5392          jazz   28.164568
895           band   21.496669
8520   saxophonist   18.032724
8479        sander   16.244042
10902        welch   15.472142
8750       shabaka   13.940216
1566        carter   13.922557
2730    desolation   13.223761
7224       pharoah   12.723585
9919        techno   12.531852
2300       country   11.707757
8963         sitar   11.690044
4076        fratti   11.684111
897     bandleader   11.527554
10145         tomb   11.486581
4240       gendels   10.965636
8580       screamo   10.919283
10985       wilkes   10.735663
2569    deathmetal   10.698927
9063        sludge   10.264839
6696          niño   10.227915
6383       mixtape   10.199014
78             200   10.019603
7873      rawlings    9.875880
968           beat    9.714738
3060         drill    9.406874


Conclusion: Finding are that metal and rap have their own languages while, rock, pop etc share vocab. Below to visualise chi2 data with color depending on genre.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 8))
sns.barplot(x='chi2_score', y='feature', hue='genre', data=top_words)
plt.title('Top 10 Important Words by Genre (Chi² Scores)')
plt.xlabel('Chi² Score')
plt.ylabel('Words')
plt.legend(title='Genre')
plt.show()

Use language to predict score. vectorize again but this time add score to dataframe instead of Genre.

2. Using Language to Predict the Score

    Approach: Predicting the score based on review text is a regression task because the score is numerical.
        Text Representation: Convert the text data into numerical features using TF-IDF, BoW, or word embeddings (e.g., Word2Vec).
        Model Selection: You can use regression models like Linear Regression, Random Forest Regressor, or even more advanced methods like XGBoost or neural networks (using PyTorch or TensorFlow).
        Feature Engineering: You can also try including sentiment scores as features, which might correlate with the review score.
       

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# TF-IDF for feature extraction
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['Review'])  # Text features
y = df['Score']  # Target (numerical score)

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("Predicted Scores:", y_pred)

Predicting Score from Review Text:

    Convert review text into numerical features using TF-IDF or BoW.
    Train regression models (e.g., Linear Regression, Random Forest).
    Optionally, include sentiment analysis as a feature to enhance predictions.

In [None]:


Exploratory analysis
- word frequency
- word cloud by genre and by score range

Feature extraction (turn words to numbers)
 - bag of words or tf-idf
 
Sentiment analysis / modelling score prediction


Word cloud broken out by score range

