# Pitchfork Music Reviews - Natural Language Processing

I plan to analyse album reviews published on https://pitchfork.com using natural language processing (NLP) techniques to explore how descriptive language varies across music genres. My aim is to identify whether reviews adhere to a consistent terminology, or if certain music styles have terms and descriptors unique to them. Based on these findings, I will train a classification model to predict the genre of unseen reviews.

This notebook covers the application of natural language processing techniques to our review dataset, looks at the how language changes between genres and covers the training and testing of a classification model in applying these findings to unseen reviews.

#### Contents:
1. Text Pre-Processing
2. Natural Language Processing
    2. Findings
3. Review Classification
    3. Findings

## 1. Text Pre-Processing

Before I can apply NLP techniques to my review data, I first need to pre-process the text. I will remove punctuation from the corpus and then define my vectorizer object which will be used to convert the text into numerical vectors that capture their semantic meaning. In the process of doing this, the text will be set to lower case, have stop words removed, be lemmatised and tokenised.

In [7]:
# load libraries
import pandas as pd
import re
import time
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report

In [8]:
# load cleaned data
df = pd.read_csv("review_data_clean.csv")

# remove punctuation
df['Text'] = df['Text'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

In [9]:
# Define Lemmatizing and Tokenizing function
class WordLemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
    
# Adding common non-genre specific music terminology to the stop words list
stop_words= stopwords.words('english') + ['like', 'album','music','sound','song','track','record','artist','new','one']

# Define vectorizer object
vectorizer=TfidfVectorizer(analyzer='word',
                           input='content',
                           lowercase=True,
                           stop_words= stop_words,
                           min_df=3,
                           ngram_range=(1,2),
                           tokenizer=WordLemmaTokenizer())

## Natural Language Processing

In this secton I'll be applying my vectorizer to the text data to identify the most important words of each genre. Word importance will be calculated using the Term Frequency - Inverse Document Frequency (TF-IDF) method. This technique is usually applied to the entire body of text after which I could break out the importance by genre however, I found that applying the TF-IDF vectorizer to each genre's text individually produced improved results with less overlap between genres. I have also done some feature engineering by adding non-genre specific music terminology to the list of stop words to be excluded. This helps remove the noise of having words like 'album' and 'song' be deemed as important when they are irrelevant to the genre itself.

In [4]:
# create list to store important words
important_words_by_genre = {}
# define number of words per genre
top_n = 40

# fit vectorizer on each genre's text data seperately
for genre in df['Genre'].unique():
    # Filter the text by genre
    genre_text = df[df['Genre'] == genre]['Text']
    # Fit and transform the vectorizer on this subset
    X_genre_tfidf = vectorizer.fit_transform(genre_text)
    feature_names = vectorizer.get_feature_names_out()
    # Create a DataFrame to hold the TF-IDF scores
    genre_tfidf_df = pd.DataFrame(X_genre_tfidf.toarray(), columns=feature_names)
    # Sum TF-IDF scores across all documents within the genre
    genre_word_scores = genre_tfidf_df.sum(axis=0).sort_values(ascending=False)[:top_n]
    # Store as a DataFrame with words and scores for this genre
    important_words_by_genre[genre] = pd.DataFrame({
        'word': genre_word_scores.index,
        'tfidf_score': genre_word_scores.values
    })



In [5]:
# save important words for each genre in their own dataframe
electronic = important_words_by_genre['Electronic']
# filter out words less than three letters to remove lemmatizing mistakes
electronic = electronic[electronic['word'].apply(lambda x: len(x) > 2)]

pop = important_words_by_genre['Pop']
pop = pop[pop['word'].apply(lambda x: len(x) > 2)]

rock = important_words_by_genre['Rock']
rock = rock[rock['word'].apply(lambda x: len(x) > 2)]

experimental = important_words_by_genre['Experimental']
experimental = experimental[experimental['word'].apply(lambda x: len(x) > 2)]

rap = important_words_by_genre['Rap']
rap = rap[rap['word'].apply(lambda x: len(x) > 2)]

folk = important_words_by_genre['Folk']
folk = folk[folk['word'].apply(lambda x: len(x) > 2)]

jazz = important_words_by_genre['Jazz']
jazz = jazz[jazz['word'].apply(lambda x: len(x) > 2)]

metal = important_words_by_genre['Metal']
metal = metal[metal['word'].apply(lambda x: len(x) > 2)]

The top 40 most important words for each genre have now been saved and are visualised below.

In [10]:
# create visualisation
genres = ["Electronic", "Pop", "Rock", "Experimental", "Rap", "Folk", "Jazz", "Metal"]

fig = make_subplots(rows=len(genres), cols=1, shared_yaxes=True, subplot_titles=genres, vertical_spacing = 0.05)

for i, genre in enumerate(genres):
    genre_df = important_words_by_genre[genre]  # This assumes each genre's words & scores are in a dictionary
    genre_df = genre_df[genre_df['word'].apply(lambda x: len(x) > 2)]
    fig.add_trace(go.Scatter(x=genre_df['word'], y=genre_df['tfidf_score'], name=genre), row=i+1, col=1)

fig.update_annotations(font_size=12)
fig.update_xaxes(tickangle=45)  

fig.update_layout(height=2000, width=1000, title_text="Top Words by TF-IDF Score for Each Genre", showlegend=False)

fig.show()

In [31]:
# add genre tags
pop['genre']= 'pop'
electronic['genre']='electronic'
rock['genre']= 'rock'
experimental['genre']= 'experimental'
rap['genre']= 'rap'
folk['genre']= 'folk'
jazz['genre']= 'jazz'
metal['genre']= 'metal'

# combine into single dataframe
data_frames = [electronic, pop, rock, experimental, rap, folk, jazz, metal]
full_tfidf = pd.concat(data_frames).reset_index()
full_tfidf = full_tfidf.iloc[:,1:]

# save output
full_tfidf.to_csv('/Users/simoncrouch/Desktop/analysis_data.csv', index=False)

### 2B. Findings

My analysis of TF-IDF scores across different music genres reveals the presence of genre-specific vocabularly with significant variations in the language used. The distribution of the TF-IDF scores provide insight into how distinctive and specialized the language is within each genre's reviews.

Rock music displayed the highest genre-specific vocabularly followed by rap, with metal displaying the lowest, though this is likely explained by the dataset for Metal being particularly small. 

Rock, folk, electronic and jazz reviews all had a strong focus on the instruments used with rock and folk focusing on 'guitar', electronic music on 'producer' and 'synth' and jazz on 'saxophone' and 'piano'. 

Pop was found to centre around the identity of the artist with words such as 'girl', 'singer', 'vocal', and 'star' dominating the pop genre, while rap focused on the performers lyrics and delivery emphasising 'flow', 'verse' and 'bar'. 

Upon initially conducting this analysis, several terms such as 'album', 'music', 'song' and 'artist' were found to be common across all genres due to them being universal terms used to describe music and so were removed from the dataset to reduce noise and promote genre specific terminology. 

## 3. Text Classification

The distinctive vocabulary used within each genre suggests these words can serve as reliable predictors in genre classification models. In this section, I will be using this data to train and test models on their ability to correctly predict an album's genre based on its review.

### Random Forest Classifier
Here I fit a Random Forest Classifier with the vectorizer being applied to the entire corpus in a single action

In [37]:
# vectorize entire body of text
X = vectorizer.fit_transform(df['Text'])
feature_names = vectorizer.get_feature_names_out()
# Create a DataFrame to hold the TF-IDF scores
tfidf_df = pd.DataFrame(X.toarray(), columns=feature_names)
# add target column to tf-idf dataframe
tfidf_df['Genre'] = df['Genre'].values

In [46]:
# assign feature columns (TF-IDF scores)
X = tfidf_df.drop(columns=['Genre'])
# assign target column (Genre)
y = tfidf_df['Genre']

# encode genre labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12, stratify=y_encoded)

In [47]:
# fit training data to model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# make predictions
y_pred = model.predict(X_test)

In [48]:
# print evaluation metrics
print("Classification Report - Random Forest Classifier:")
print(classification_report(y_test, y_pred, 
                            target_names=tfidf_df['Genre'].unique()))

Classification Report - Random Forest Classifier:
              precision    recall  f1-score   support

  Electronic       0.67      0.57      0.62        28
         Pop       0.50      0.07      0.12        15
        Rock       0.00      0.00      0.00        10
Experimental       0.00      0.00      0.00         6
         Rap       0.00      0.00      0.00         2
        Folk       0.57      0.17      0.26        24
        Jazz       0.95      0.77      0.85        26
       Metal       0.46      1.00      0.63        48

    accuracy                           0.56       159
   macro avg       0.39      0.32      0.31       159
weighted avg       0.54      0.56      0.49       159




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Our Random Forest Classifier model has an initial accuracy of 56%

### Naive Bayes Classifier
Here I fit a Naive Bayes Classifier with the vectorizer being applied to the entire corpus in a single action. I am using the ComplementNB model due to the imbalanced dataset.

In [50]:
# We are using the same data as the previous model so don't need to re-vectorize 
# or split into training and testing groups

# Train the classifier
clf = ComplementNB()
clf.fit(X_train, y_train)
    
# Make predictions
y_pred = clf.predict(X_test)
    
# Print evaluation metrics
print("Classification Report - ComplementNB:")
print(classification_report(y_test, y_pred, 
                            target_names=le.classes_))

Classification Report - ComplementNB:
              precision    recall  f1-score   support

  Electronic       0.64      0.50      0.56        28
Experimental       0.00      0.00      0.00        15
        Folk       1.00      0.10      0.18        10
        Jazz       1.00      0.17      0.29         6
       Metal       0.00      0.00      0.00         2
         Pop       0.80      0.33      0.47        24
         Rap       0.91      0.81      0.86        26
        Rock       0.46      0.98      0.63        48

    accuracy                           0.58       159
   macro avg       0.60      0.36      0.37       159
weighted avg       0.62      0.58      0.52       159




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Our Naive Bayes Classifier has an initial accuracy of 58%

### Naive Bayes Classifier - Model 2

I will now train a new ComplementNB model using the approach I took in our NLP analysis. This is where the data is vectorised one genre at a time, improving the algorithm's ability to identify key terminology.

In [56]:
# Create an empty list to hold each genre's TF-IDF DataFrame
tfidf_dfs = []

for genre in df['Genre'].unique():
    # Filter text by genre
    genre_text = df[df['Genre'] == genre]['Text']
    # Fit and transform the vectorizer on this genre's text data
    X_genre_tfidf = vectorizer.fit_transform(genre_text)
    feature_names = vectorizer.get_feature_names_out()
    # remove words less than two characters
    feature_names = [word for word in feature_names if len(word) > 2]
    # Create a DataFrame with TF-IDF scores and add the genre as a column
    genre_tfidf_df = pd.DataFrame(X_genre_tfidf.toarray(), columns=vectorizer.get_feature_names_out())
    genre_tfidf_df = genre_tfidf_df[feature_names]
    genre_tfidf_df['Genre'] = genre  # Add genre label as a new column
    # Append to the list
    tfidf_dfs.append(genre_tfidf_df)

# Concatenate all genre-specific TF-IDF DataFrames into one
combined_tfidf_df = pd.concat(tfidf_dfs, ignore_index=True)
# Due to different word lists between genres the dataframe contains NaN values which must be converted to 0s
combined_tfidf_df = combined_tfidf_df.fillna(0)


In [57]:
# assign feature columns (TF-IDF scores)
X = combined_tfidf_df.drop(columns=['Genre'])
# assign target column (genre)
y = combined_tfidf_df['Genre']                

# Encode genre labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=12,
    stratify=y_encoded  # Maintain genre distribution in train/test sets
)
    
# Train the classifier
clf = ComplementNB()
clf.fit(X_train, y_train)
    
# Make predictions
y_pred = clf.predict(X_test)
    
# Print evaluation metrics
print("Classification Report - Naive Bayes trained by genre:")
print(classification_report(y_test, y_pred, 
                            target_names=le.classes_))

Classification Report - Naive Bayes trained by genre:
              precision    recall  f1-score   support

  Electronic       0.93      0.89      0.91        28
Experimental       1.00      0.73      0.85        15
        Folk       1.00      0.20      0.33        10
        Jazz       1.00      0.67      0.80         6
       Metal       0.00      0.00      0.00         2
         Pop       0.96      0.92      0.94        24
         Rap       1.00      0.96      0.98        26
        Rock       0.70      0.98      0.82        48

    accuracy                           0.86       159
   macro avg       0.82      0.67      0.70       159
weighted avg       0.88      0.86      0.84       159




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



By vectorising our corpus by genre, this version of the Naive Bayes Classifier has an improved model accuracy of 86%

## 3B. Conclusion

Having demonstrated that album reviews are comprised of genre specific language, I have applied multiple machine learning approaches to train models to classify an album's genre based upon its written review.

The genre-specific TF-IDF vectorization provided the best results with a Naive Bayes classifier, achieving an 86% overall accuracy. Though TF-IDF is usualy applied across the entire corpus, this method suggests that separating the text by genre prior to vectorizing enhances its ability to capture these difference. Model accuracy varied significantly across genres, with high performance in "Electronic," "Pop," and "Rap," but lower success in other such as "Folk" and "Experimental". This likely reflects both the limited data available for these genres and the more nuanced or ambiguous language often found in their descriptions.

Future improvements could focus on gathering additional data to balance classes, which would help the model generalize better across all genres and support testing with other Naive Bayes variants. Using GridSearch for hyperparameter tuning would also allow for optimizing the model’s performance while balancing accuracy and runtime. Lastly, expanding the dataset to include reviews from other sources could provide insights into how consistent genre prediction remains across different review styles and vocabularies. Finally, it would be interesting to apply the model to album reviews sourced from other websites to view how consistent the language used is between publications.