# Pitchfork Music Reviews - Natural Language Processing

I plan to analyse album reviews published on https://pitchfork.com using natural language processing (NLP) techniques to explore how descriptive language varies across music genres. My aim is to identify whether reviews adhere to a consistent terminology, or if certain music styles have terms and descriptors unique to them. Based on these findings, I will train a classification model to predict the genre of unseen reviews.

This notebook covers the application of natural language processing techniques to our review dataset, looks at the how language changes between genres and covers the training and testing of a classification model in applying these findings to unseen reviews.

## NLP

### Text pre-processing
- lower casing
- tokenisation
- remove punctuation
- stop words
- lemmatisation

In [None]:
import pandas as pd
import re
import time
import plotly.express as px
import plotly.figure_factory as ff
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report

In [None]:
# load cleaned data
df = pd.read_csv("review_data.csv")

In [12]:
df.to_csv('/Users/simoncrouch/Desktop/review_data_clean.csv', index=False)

In [22]:
# convert all review text to lowercase
df['Text'] = df['Text'].str.lower()

# split text into individual tokens
df['tokens'] = df['Text'].apply(word_tokenize)
# drop Text column
#df = df.drop(['Text'], axis = 1)

# remove punctuation
df['tokens'] = df['tokens'].apply(
    lambda tokens: [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens])

# remove stop words - editing ntlk's list
stop_words = set(stopwords.words('english')) - {'no', 'not'}
df['tokens'] = df['tokens'].apply(lambda tokens: [w for w in tokens if not w in stop_words])

# lemmatise text
lemmatizer = WordNetLemmatizer()
df['tokens'] = df['tokens'].apply(lambda tokens: [lemmatizer.lemmatize(w) for w in tokens])

### Text Classification
Aim to identify the most important words for each genre

In [24]:
def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text)
    return text 

df['Text'] = df['Text'].apply(clean_text)

In [113]:
# Define Lemmatizing and Tokenizing function
class WordLemmaTokenizer(object):
    def __init__(self):
        self.wnl=WordNetLemmatizer()
    def __call__(self,doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
    
# removing common non-genre specific music terminology
stop_words= stopwords.words('english') + ['like', 'album','music','sound','song','track','record','artist','new','one']

# Define vectorizer object
vectorizer=TfidfVectorizer(analyzer='word',
                           input='content',
                           lowercase=True,
                           #stop_words= set(stopwords.words('english'))
                           # removing common non-genre specific music terminology
                           stop_words= stop_words,
                           min_df=3,
                           ngram_range=(1,2),
                           tokenizer=WordLemmaTokenizer())

In [114]:
# create list to store important words
important_words_by_genre = {}
# define number of words per genre
top_n = 40

# fit vectorizer on each genre's text data seperately
for genre in df['Genre'].unique():
    # Filter the text by genre
    genre_text = df[df['Genre'] == genre]['Text']
    # Fit and transform the vectorizer on this subset
    X_genre_tfidf = vectorizer.fit_transform(genre_text)
    feature_names = vectorizer.get_feature_names_out()
    # Create a DataFrame to hold the TF-IDF scores
    genre_tfidf_df = pd.DataFrame(X_genre_tfidf.toarray(), columns=feature_names)
    # Sum TF-IDF scores across all documents within the genre
    genre_word_scores = genre_tfidf_df.sum(axis=0).sort_values(ascending=False)[:top_n]
    # Store as a DataFrame with words and scores for this genre
    important_words_by_genre[genre] = pd.DataFrame({
        'word': genre_word_scores.index,
        'tfidf_score': genre_word_scores.values
    })
    
# important_words_by_genre['Genre Name'] to access values


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'re", "'s", "'ve", 'could', 'doe', 'ha', 'might', 'must', "n't", 'need', 'sha', 'wa', 'wo', 'would'] not in stop_words.



In [138]:
genre_tfidf_df

Unnamed: 0,90,across,aesthetic,affiliate,along,already,also,always,another,around,...,wave,way,whose,within,word,work,world,would,year,yet
0,0.105146,0.0,0.117144,0.0,0.0,0.0,0.0,0.0,0.0,0.117144,...,0.117144,0.073541,0.0,0.105146,0.0,0.0,0.0,0.0,0.079874,0.0
1,0.0,0.114873,0.0,0.0,0.0,0.114873,0.0,0.0,0.0,0.0,...,0.0,0.072116,0.103108,0.0,0.0,0.0,0.114873,0.0,0.0,0.0
2,0.086678,0.0,0.0,0.0,0.0,0.0,0.071764,0.193138,0.086678,0.193138,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065845,0.143528
3,0.0,0.0,0.118148,0.0,0.0,0.0,0.0878,0.0,0.0,0.0,...,0.0,0.0,0.0,0.106047,0.0,0.0,0.0,0.0,0.080559,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.090345,0.0,0.109121,0.0,...,0.0,0.076321,0.0,0.0,0.0,0.0,0.0,0.0,0.082894,0.090345
5,0.0,0.0,0.0,0.09104,0.09104,0.182079,0.0,0.0,0.32686,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074096,0.0,0.13531
6,0.0,0.059748,0.0,0.059748,0.0,0.059748,0.0,0.059748,0.214514,0.179244,...,0.059748,0.075018,0.107257,0.053629,0.059748,0.059748,0.119496,0.048629,0.0,0.0
7,0.0,0.0,0.110642,0.0,0.0,0.0,0.082222,0.0,0.0,0.0,...,0.0,0.138918,0.0,0.0,0.110642,0.0,0.221284,0.0,0.0,0.0
8,0.062842,0.0,0.0,0.070013,0.0,0.0,0.156088,0.0,0.0,0.0,...,0.0,0.0,0.062842,0.125684,0.0,0.070013,0.0,0.056983,0.095476,0.052029
9,0.0,0.0,0.0,0.0,0.085618,0.0,0.0,0.085618,0.0,0.0,...,0.0,0.053749,0.0,0.0,0.0,0.171235,0.0,0.069684,0.0,0.063626


In [115]:
# filter out words less than three letters to remove lemmatizing mistakes
electronic = important_words_by_genre['Electronic']
electronic = electronic[electronic['word'].apply(lambda x: len(x) > 2)]

pop = important_words_by_genre['Pop']
pop = pop[pop['word'].apply(lambda x: len(x) > 2)]

rock = important_words_by_genre['Rock']
rock = rock[rock['word'].apply(lambda x: len(x) > 2)]

experimental = important_words_by_genre['Experimental']
experimental = experimental[experimental['word'].apply(lambda x: len(x) > 2)]

rap = important_words_by_genre['Rap']
rap = rap[rap['word'].apply(lambda x: len(x) > 2)]

folk = important_words_by_genre['Folk']
folk = folk[folk['word'].apply(lambda x: len(x) > 2)]

jazz = important_words_by_genre['Jazz']
jazz = jazz[jazz['word'].apply(lambda x: len(x) > 2)]

metal = important_words_by_genre['Metal']
metal = metal[metal['word'].apply(lambda x: len(x) > 2)]

In [116]:
# create visualisation
genres = ["Electronic", "Pop", "Rock", "Experimental", "Rap", "Folk", "Jazz", "Metal"]

fig = make_subplots(rows=len(genres), cols=1, shared_yaxes=True, subplot_titles=genres)

for i, genre in enumerate(genres):
    genre_df = important_words_by_genre[genre]  # This assumes each genre's words & scores are in a dictionary
    genre_df = genre_df[genre_df['word'].apply(lambda x: len(x) > 2)]
    fig.add_trace(go.Scatter(x=genre_df['word'], y=genre_df['tfidf_score'], name=genre), row=i+1, col=1)

fig.update_annotations(font_size=12)

fig.update_layout(height=2000, width=1000, title_text="Top Words by TF-IDF Score for Each Genre", showlegend=False)

fig.show()

In [128]:
# add genre tags and save output
pop['genre']= 'pop'
electronic['genre']='electronic'
rock['genre']= 'rock'
experimental['genre']= 'experimental'
rap['genre']= 'rap'
folk['genre']= 'folk'
jazz['genre']= 'jazz'
metal['genre']= 'metal'

data_frames = [electronic, pop, rock, experimental, rap, folk, jazz, metal]
 
full_tfidf = pd.concat(data_frames).reset_index()
full_tfidf = full_tfidf.iloc[:,1:]

full_tfidf.to_csv('/Users/simoncrouch/Desktop/analysis_data.csv', index=False)

### Write up Analysis

Score Differences

Genres have different words. Which voerlap, which are more unique?

The variation in TF-IDF score ranges between genres like Metal and Rock suggests differences in word importance within the genres' vocabularies, reflecting how distinct or specialized the language is in each genre's reviews. Here’s how to interpret this:

    Higher Maximum TF-IDF in Rock: If Rock has a maximum TF-IDF score of 12 for its top word while Metal tops out at 2.5, this implies that Rock reviews use a few highly distinctive words more consistently across reviews. In other words, Rock reviews may have certain words that are both specific to Rock and frequently appear, which increases their TF-IDF scores. Metal reviews, on the other hand, might have less concentrated usage of specific words.

    Score Range and Term Specificity: The broader score range in Rock (12 to 2.5) suggests that there is a clearer hierarchy of word importance; certain terms stand out as particularly characteristic of Rock reviews. For Metal, the narrower range (2.5 to 0.3) indicates that while certain words are somewhat distinctive, they don’t dominate the genre’s vocabulary as much, which could mean Metal reviews use more varied or genre-neutral language.

    Inter-Genre Comparison: The TF-IDF scores are relative to each dataset's context, meaning that the vectorizer has calculated importance based on the frequency and exclusivity of words within each genre's reviews. Consequently, words in Rock reviews might have higher scores because they are more unusual when compared to other genres, while Metal might share more vocabulary with other genres, diluting its TF-IDF scores.

This analysis highlights how genre language varies in specificity and distinctiveness. You can interpret TF-IDF scores as a measure of how closely reviews for a genre gravitate around a unique vocabulary, helping to distinguish genre-specific language patterns.

THIS SECTION PROVING THAT DIFFERENT LANGUAGE IS USED FOR DIFFERENT GENRES
NEXT SECTION, SO CAN WE USE THAT TO DETERMINE WHAT GENRE A REVIEW IS FOR BASED ON ITS TEXT
THEN CAN WE PREDICT THE SCORE BASED ON LANGUAGE USED (If I can be bothered).

High-Scoring Terms by Genre:

    Electronic: Words like "producer," "synth," "ambient," and "techno" have high TF-IDF scores, highlighting the genre's focus on production elements and electronic subgenres.
    Rock: "Band," "guitar," and "rock" are central, indicating the importance of traditional band setups and instruments.
    Rap: "Rap," "beat," and "rapper" emphasize rhythmic and vocal elements typical in this genre.
    Jazz: "Saxophone," "solo," and "piano" appear, reflecting jazz's reliance on individual instruments and improvisation.
    Metal: Terms like "riff," "death," and "darkness" reveal themes of intensity and dark tonality.

Common Words Across Genres:

    Words like "album," "song," and "music" score highly across most genres, as they’re core to discussing any music form.

Genre-Specific Language Patterns:

    This data suggests that while certain themes (like emotion, time, and creation) are universal, genres diverge in their lexicon. For instance, rock and metal often share references to instruments, but metal includes darker thematic words.

Potential Applications:

    Genre Prediction: These TF-IDF scores can serve as features in a classification model to predict the genre of a review based on word frequency.
    
    
Genre-Defining Terms:


Rock has the strongest genre-specific vocabulary, with "band" having the highest TF-IDF score (11.15) across all genres
Rap shows strong association with its core elements ("rap", "beat", "rapper")
Metal focuses on specific subgenres ("death metal") and technical elements ("riff")


Instrumental Focus:


Guitar appears prominently in rock (6.72), folk (2.28), and appears across multiple genres
Electronic music emphasizes production elements ("producer", "drum", "synth")
Jazz highlights specific instruments ("piano", "saxophone", "bass")


Vocal/Lyrical Elements:


Pop emphasizes vocals and emotional content ("voice", "love", "she's")
Rap focuses on delivery ("flow", "verse", "bar")
Metal pays attention to lyrical themes ("darkness")


Common Ground:


"Time" appears as an important term across multiple genres
"Feel" is significant in electronic, rap, and rock
"Voice" appears prominently in pop and crosses over to other genres


Production Elements:


Electronic music has strong associations with production terms ("producer", "ambient", "synth")
Rap emphasizes beats and production ("beat", "producer", "sample")
Folk and rock tend to use more traditional instrumental terminology

# Can review be used to predict genre?


In [125]:
full_tfidf

Unnamed: 0,index,word,tfidf_score,genre
0,2,feel,3.818590,electronic
1,3,producer,3.702383,electronic
2,4,vocal,3.495372,electronic
3,5,drum,3.423403,electronic
4,6,club,3.383266,electronic
...,...,...,...,...
292,35,find,0.523511,metal
293,36,live,0.519520,metal
294,37,return,0.516444,metal
295,38,piece,0.512191,metal


In [135]:
X_train

Unnamed: 0,tfidf_score
249,0.824964
216,1.097798
31,2.493152
228,1.057363
47,2.562894
...,...
259,2.191895
130,1.654204
241,0.893185
253,0.777460


In [141]:
df

Unnamed: 0,Text,Genre,Score,tokens
0,ten years after his big solo debut the uk prod...,Electronic,7.3,"[ten, year, big, solo, debut, , uk, producer, ..."
1,riding the success of singles like that and no...,Pop,7.2,"[riding, success, single, “, like, , ”, “, not..."
2,on their debut collaboration the beatmaker and...,Electronic,7.8,"[debut, collaboration, , beatmaker, drummer, s..."
3,the philly groups second live album is a celeb...,Rock,7.9,"[philly, group, ’, second, live, album, celebr..."
5,on her third album the uk singersongwriter sou...,Pop,8.5,"[third, album, , uk, singersongwriter, sound, ..."
...,...,...,...,...
800,the reggae veterans new studio album doesnt ma...,Pop,6.7,"[reggae, veteran, ’, new, studio, album, ’, ma..."
801,in diaphanous compositions like color field pa...,Experimental,7.5,"[diaphanous, composition, like, color, field, ..."
802,the singaporean bands new album showcases a pu...,Rock,7.2,"[singaporean, band, ’, new, album, showcase, p..."
803,each sunday pitchfork takes an indepth look at...,Folk,9.6,"[sunday, , pitchfork, take, indepth, look, sig..."


Fit Random Forest Classifier

In [142]:
# vectorize entire body of text
X_tfidf = vectorizer.fit_transform(df['Text'])
feature_names = vectorizer.get_feature_names_out()
# Create a DataFrame to hold the TF-IDF scores
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)


In [147]:
# add target column to tf-idf dataframe
tfidf_df['Genre'] = df['Genre'].values

In [148]:
X = tfidf_df.drop(columns=['Genre'])  # Feature columns (TF-IDF scores)
y = tfidf_df['Genre']                 # Target column (genre)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [149]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

RandomForestClassifier()

apply classification report instead then don't need to load in accuracy at start. same for next model.

In [150]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6289308176100629


Training on each genre separately

In [180]:
# Create an empty list to hold each genre's TF-IDF DataFrame
tfidf_dfs = []

for genre in df['Genre'].unique():
    # Filter text by genre
    genre_text = df[df['Genre'] == genre]['Text']
    # Fit and transform the vectorizer on this genre's text data
    X_genre_tfidf = vectorizer.fit_transform(genre_text)
    feature_names = vectorizer.get_feature_names_out()
    # Create a DataFrame with TF-IDF scores and add the genre as a column
    genre_tfidf_df = pd.DataFrame(X_genre_tfidf.toarray(), columns=feature_names)
    genre_tfidf_df['Genre'] = genre  # Add genre label as a new column
    # Append to the list
    tfidf_dfs.append(genre_tfidf_df)


In [181]:
# Concatenate all genre-specific TF-IDF DataFrames into one
combined_tfidf_df = pd.concat(tfidf_dfs, ignore_index=True)
# Due to different word lists between genres the dataframe contains NaN values which must be converted to 0s
combined_tfidf_df = combined_tfidf_df.fillna(0)

In [174]:
X = combined_tfidf_df.drop(columns=['Genre'])  # Feature columns (TF-IDF scores)
y = combined_tfidf_df['Genre']                 # Target column (genre)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.6540880503144654


Accuracy has improved! Now hyperparameter tuning.

Fit naive Bayes

NB using whole text

In [185]:
df

Unnamed: 0,Text,Genre,Score,tokens
0,ten years after his big solo debut the uk prod...,Electronic,7.3,"[ten, year, big, solo, debut, , uk, producer, ..."
1,riding the success of singles like that and no...,Pop,7.2,"[riding, success, single, “, like, , ”, “, not..."
2,on their debut collaboration the beatmaker and...,Electronic,7.8,"[debut, collaboration, , beatmaker, drummer, s..."
3,the philly groups second live album is a celeb...,Rock,7.9,"[philly, group, ’, second, live, album, celebr..."
5,on her third album the uk singersongwriter sou...,Pop,8.5,"[third, album, , uk, singersongwriter, sound, ..."
...,...,...,...,...
800,the reggae veterans new studio album doesnt ma...,Pop,6.7,"[reggae, veteran, ’, new, studio, album, ’, ma..."
801,in diaphanous compositions like color field pa...,Experimental,7.5,"[diaphanous, composition, like, color, field, ..."
802,the singaporean bands new album showcases a pu...,Rock,7.2,"[singaporean, band, ’, new, album, showcase, p..."
803,each sunday pitchfork takes an indepth look at...,Folk,9.6,"[sunday, , pitchfork, take, indepth, look, sig..."


In [186]:
# vectorize entire body of text
X_tfidf = vectorizer.fit_transform(df['Text'])
feature_names = vectorizer.get_feature_names_out()
# Create a DataFrame to hold the TF-IDF scores
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)


X = tfidf_df  # All text data
y = df['Genre']  # All genre labels

# Encode genre labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=12,
    stratify=y_encoded  # Maintain genre distribution in train/test sets
)
    
# Train the classifier
clf = ComplementNB()
clf.fit(X_train, y_train)
    
# Make predictions
y_pred = clf.predict(X_test)
    
# Print evaluation metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, 
                            target_names=le.classes_))

Classification Report:
              precision    recall  f1-score   support

  Electronic       0.64      0.50      0.56        28
Experimental       0.00      0.00      0.00        15
        Folk       1.00      0.10      0.18        10
        Jazz       1.00      0.17      0.29         6
       Metal       0.00      0.00      0.00         2
         Pop       0.80      0.33      0.47        24
         Rap       0.91      0.81      0.86        26
        Rock       0.46      0.98      0.63        48

    accuracy                           0.58       159
   macro avg       0.60      0.36      0.37       159
weighted avg       0.62      0.58      0.52       159




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



NB when tf-idf has been trained on each genre seperately

In [187]:
# Create an empty list to hold each genre's TF-IDF DataFrame
tfidf_dfs = []

for genre in df['Genre'].unique():
    # Filter text by genre
    genre_text = df[df['Genre'] == genre]['Text']
    # Fit and transform the vectorizer on this genre's text data
    X_genre_tfidf = vectorizer.fit_transform(genre_text)
    feature_names = vectorizer.get_feature_names_out()
    # Create a DataFrame with TF-IDF scores and add the genre as a column
    genre_tfidf_df = pd.DataFrame(X_genre_tfidf.toarray(), columns=feature_names)
    genre_tfidf_df['Genre'] = genre  # Add genre label as a new column
    # Append to the list
    tfidf_dfs.append(genre_tfidf_df)

# Concatenate all genre-specific TF-IDF DataFrames into one
combined_tfidf_df = pd.concat(tfidf_dfs, ignore_index=True)
# Due to different word lists between genres the dataframe contains NaN values which must be converted to 0s
combined_tfidf_df = combined_tfidf_df.fillna(0)


In [188]:
X = combined_tfidf_df.drop(columns=['Genre'])  # Feature columns (TF-IDF scores)
y = combined_tfidf_df['Genre']                 # Target column (genre)

# Encode genre labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=12,
    stratify=y_encoded  # Maintain genre distribution in train/test sets
)
    
# Train the classifier
clf = ComplementNB()
clf.fit(X_train, y_train)
    
# Make predictions
y_pred = clf.predict(X_test)
    
# Print evaluation metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, 
                            target_names=le.classes_))

Classification Report:
              precision    recall  f1-score   support

  Electronic       0.92      0.86      0.89        28
Experimental       1.00      0.80      0.89        15
        Folk       1.00      0.20      0.33        10
        Jazz       1.00      0.67      0.80         6
       Metal       0.00      0.00      0.00         2
         Pop       0.96      0.92      0.94        24
         Rap       1.00      0.96      0.98        26
        Rock       0.70      0.98      0.82        48

    accuracy                           0.86       159
   macro avg       0.82      0.67      0.71       159
weighted avg       0.88      0.86      0.84       159




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Analysis

why ComplementNB might outperform Naive Bayes and Random Forest for your music genre classification task.

Class Imbalance Impact:


Your data shows significant imbalance:

Rock: 48 samples
Electronic: 28 samples
Rap: 26 samples
Pop: 24 samples
Experimental: 15 samples
Folk: 10 samples
Jazz: 6 samples
Metal: 2 samples


ComplementNB is specifically designed for imbalanced datasets:

It estimates parameters using samples from all classes except the one being modeled
This helps with the small classes (Metal, Jazz, Folk) by using more data for parameter estimation
Regular Naive Bayes would struggle with small classes as it has less data to estimate parameters
This explains why Metal (2 samples) got 0% and Folk (10 samples) got low recall but perfect precision




Text Classification Characteristics:


ComplementNB advantages:

Better handles the "long tail" of rare words in music reviews
More robust to vocabulary differences between genres
Less sensitive to dominant classes overwhelming minority classes


Regular Naive Bayes limitations:

More sensitive to class imbalance
Can be overwhelmed by dominant classes (Rock in your case)
May overfit to specific words in small classes




Model Performance Analysis:


Strengths shown in results:

High precision across most genres (many 1.00)
Strong performance on medium-sized classes (Rap: 0.98 F1, Pop: 0.94 F1)
Good balance for Electronic (0.89 F1)


Challenges shown:

Metal: Complete failure (0.00 across board) due to tiny sample size
Folk: Low recall (0.20) despite perfect precision
Rock: Lower precision (0.70) but high recall (0.98)




Why Random Forest Performed Middle:


Random Forest characteristics:

Good at handling non-linear relationships
Can capture complex patterns in text
But may struggle with:

High-dimensional sparse data (typical in text)
Very imbalanced classes
Limited training data for some classes

In [189]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8553459119496856


Hyperparameter tuning

Pipeline Integration: Combines the vectorizer and classifier into a single pipeline, ensuring proper feature transformation during cross-validation.
Key Parameters to Tune:

For TfidfVectorizer:

min_df: Try different minimum document frequencies
ngram_range: Test different n-gram combinations
max_features: Limit vocabulary size
norm: Try different normalization schemes


For ComplementNB:

alpha: Smoothing parameter
norm: Whether to normalize weight vectors




Cross-validation with GridSearchCV ensures robust parameter selection
F1-weighted scoring metric accounts for potential class imbalance

You can run this code as is, or modify the param_grid based on your specific needs. Some additional suggestions:

If runtime is a concern, you could:

Reduce the parameter grid size
Use RandomizedSearchCV instead of GridSearchCV
Reduce the number of cross-validation folds


If memory is a concern, you might want to:

Start with a smaller max_features range
Limit the ngram_range to (1,2)

In [191]:
# ORIGINAL PIPELINE WITHOUT TFIDF ON INDIVIDUAL GENRE TEXT

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

def analyze_genre_specific_features(df, vectorizer, n_top_features=10):
    """Analyze top features for each genre after fitting the vectorizer once"""
    genre_features = {}
    feature_names = vectorizer.get_feature_names_out()
    
    for genre in df['Genre'].unique():
        # Filter text by genre
        genre_mask = df['Genre'] == genre
        genre_tfidf = vectorizer.transform(df[genre_mask]['Text'])
        
        # Calculate average TF-IDF scores for this genre
        avg_tfidf = genre_tfidf.mean(axis=0).A1
        
        # Get top features
        top_indices = avg_tfidf.argsort()[-n_top_features:][::-1]
        top_features = [(feature_names[i], avg_tfidf[i]) for i in top_indices]
        genre_features[genre] = top_features
    
    return genre_features

# Prepare the data
X = df['Text']  # All text data
y = df['Genre']  # All genre labels

# Create train/test split
le = LabelEncoder()
y_encoded = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded,
    test_size=0.2,
    random_state=12,
    stratify=y_encoded
)

# Create pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        analyzer='word',
        input='content',
        lowercase=True,
        stop_words=stop_words,
        tokenizer=WordLemmaTokenizer()
    )),

    ('clf', ComplementNB())
])


# Define parameter grid
param_grid = {
    'tfidf__min_df': [2, 3, 5],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__max_features': [None, 5000, 10000],
    'tfidf__norm': ['l1', 'l2'],
    'clf__alpha': [0.1, 0.5, 1.0, 2.0],
    'clf__norm': [True, False]
}

# Perform grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

# Fit the model
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Make predictions
y_pred = grid_search.predict(X_test)

# Print classification report
print("\nClassification Report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=le.classes_
))

# If you want to analyze genre-specific features using the best vectorizer
best_vectorizer = grid_search.best_estimator_.named_steps['tfidf']
X_transformed = best_vectorizer.transform(X)
genre_features = analyze_genre_specific_features(df, best_vectorizer)

# Print top features for each genre
print("\nTop features by genre:")
for genre, features in genre_features.items():
    print(f"\n{genre}:")
    for feature, score in features:
        print(f"  - {feature}: {score:.4f}")

# Optional: Save best model
import joblib
joblib.dump(grid_search.best_estimator_, 'best_genre_classifier.pkl')

SyntaxError: invalid syntax (3400582306.py, line 56)

In [None]:
# CUSTOM PIPELINE TO INCLUDE RUNNING TFIDF ON EAC INDIVIDUAL GENRE - FEATURE ENGINEERING

from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import numpy as np

class GenreSpecificVectorizer(BaseEstimator, TransformerMixin):
    """Custom transformer that fits separate TF-IDF vectorizers for each genre"""
    
    def __init__(self, vectorizer_params=None):
        self.vectorizer_params = vectorizer_params or {}
        self.vectorizers = {}
        self.feature_names = None
        
    def fit(self, X, y):
        # Create a vectorizer for each genre
        unique_genres = np.unique(y)
        all_features = set()
        
        # First pass: fit vectorizers and collect all feature names
        for genre in unique_genres:
            genre_mask = y == genre
            genre_texts = X[genre_mask]
            
            vectorizer = TfidfVectorizer(**self.vectorizer_params)
            vectorizer.fit(genre_texts)
            self.vectorizers[genre] = vectorizer
            all_features.update(vectorizer.get_feature_names_out())
        
        # Convert to sorted list for consistent feature ordering
        self.feature_names = sorted(list(all_features))
        return self
    
    def transform(self, X, y=None):
        if y is None:
            # During prediction, transform with all vectorizers
            features_matrix = np.zeros((len(X), len(self.feature_names)))
            for genre, vectorizer in self.vectorizers.items():
                # Transform the texts using this genre's vectorizer
                genre_features = vectorizer.get_feature_names_out()
                genre_tfidf = vectorizer.transform(X)
                
                # Map the features to the correct positions
                for i, feature in enumerate(genre_features):
                    if feature in self.feature_names:
                        feature_idx = self.feature_names.index(feature)
                        features_matrix[:, feature_idx] += genre_tfidf[:, i].toarray().flatten()
            
            return features_matrix
        
        else:
            # During training, transform each text with its genre's vectorizer
            features_matrix = np.zeros((len(X), len(self.feature_names)))
            for genre in self.vectorizers.keys():
                genre_mask = y == genre
                if not any(genre_mask):
                    continue
                    
                genre_texts = X[genre_mask]
                vectorizer = self.vectorizers[genre]
                genre_features = vectorizer.get_feature_names_out()
                genre_tfidf = vectorizer.transform(genre_texts)
                
                # Map the features to the correct positions
                for i, feature in enumerate(genre_features):
                    if feature in self.feature_names:
                        feature_idx = self.feature_names.index(feature)
                        features_matrix[genre_mask, feature_idx] = genre_tfidf[:, i].toarray().flatten()
            
            return features_matrix

# Prepare the data
X = df['Text'].values  # Convert to numpy array
y = df['Genre'].values

# Create train/test split
le = LabelEncoder()
y_encoded = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded,
    test_size=0.2,
    random_state=12,
    stratify=y_encoded
)

# Create parameter grid for vectorizer
vectorizer_param_grid = {
    'vectorizer_params': [
        {
            'analyzer': 'word',
            'input': 'content',
            'lowercase': True,
            'stop_words': stop_words,
            'min_df': min_df,
            'ngram_range': ngram_range,
            'tokenizer': WordLemmaTokenizer()
        }
        for min_df in [2, 3, 5]
        for ngram_range in [(1, 1), (1, 2), (1, 3)]
    ]
}

# Create parameter grid for classifier
clf_param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'norm': [True, False]
}

# Function to perform grid search
def perform_grid_search(X_train, y_train, X_test, y_test):
    best_score = 0
    best_params = None
    best_vectorizer = None
    best_clf = None
    
    for vectorizer_params in vectorizer_param_grid['vectorizer_params']:
        # Create and fit genre-specific vectorizer
        vectorizer = GenreSpecificVectorizer(vectorizer_params)
        X_train_transformed = vectorizer.fit_transform(X_train, y_train)
        X_test_transformed = vectorizer.transform(X_test)
        
        # Grid search for classifier
        clf_grid = GridSearchCV(
            ComplementNB(),
            clf_param_grid,
            cv=5,
            scoring='f1_weighted',
            n_jobs=-1
        )
        
        clf_grid.fit(X_train_transformed, y_train)
        
        # Check if this combination gives better results
        score = clf_grid.score(X_test_transformed, y_test)
        if score > best_score:
            best_score = score
            best_params = {
                'vectorizer': vectorizer_params,
                'classifier': clf_grid.best_params_
            }
            best_vectorizer = vectorizer
            best_clf = clf_grid.best_estimator_
    
    return best_vectorizer, best_clf, best_params, best_score

# Perform grid search
print("Starting grid search...")
best_vectorizer, best_clf, best_params, best_score = perform_grid_search(
    X_train, y_train, X_test, y_test
)

# Print results
print("\nBest Parameters:")
print("Vectorizer parameters:", best_params['vectorizer'])
print("Classifier parameters:", best_params['classifier'])
print("\nBest score:", best_score)

# Make predictions with best model
X_test_transformed = best_vectorizer.transform(X_test)
y_pred = best_clf.predict(X_test_transformed)

# Print classification report
print("\nClassification Report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=le.classes_
))

# Optional: Save best model
import joblib
model_dict = {
    'vectorizer': best_vectorizer,
    'classifier': best_clf,
    'label_encoder': le
}
joblib.dump(model_dict, 'best_genre_specific_classifier.pkl')

In [182]:
X = combined_tfidf_df.drop(columns=['Genre'])  # Feature columns (TF-IDF scores)
y = combined_tfidf_df['Genre']                 # Target column (genre)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

In [183]:
#from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, confusion_matrix

# Prepare the data
X = df['Text']  # All text data
y = df['Genre']  # All genre labels

# Create train/test split
le = LabelEncoder()
y_encoded = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded,
    test_size=0.2,
    random_state=12,
    stratify=y_encoded
)


# Create a pipeline that combines vectorization and classification
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        analyzer='word',
        input='content',
        lowercase=True,
        stop_words=stop_words,
        tokenizer=WordLemmaTokenizer()
    )),
    ('clf', ComplementNB())
])

# Define parameter grid for both vectorizer and classifier
param_grid = {
    'tfidf__min_df': [2, 3, 5],
    'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'tfidf__max_features': [None, 5000, 10000],
    'tfidf__norm': ['l1', 'l2'],
    'clf__alpha': [0.1, 0.5, 1.0, 2.0],
    'clf__norm': [True, False]
}

# Create GridSearchCV object
grid_search = RandomizedSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

# Fit the grid search
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Make predictions with best model
y_pred = grid_search.predict(X_test)

# Print classification report
print("\nClassification Report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=le.classes_
))

# Optional: Save best model
#import joblib
#joblib.dump(grid_search.best_estimator_, 'best_genre_classifier.pkl')

Fitting 5 folds for each of 10 candidates, totalling 50 fits








50 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/simoncrouch/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/simoncrouch/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/Users/simoncrouch/opt/anaconda3/lib/python3.9/site-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/Users/simoncrouch/opt/anaconda3/lib/python3.9/site-packages/sklearn/naive_

ValueError: Found input variables with inconsistent numbers of samples: [7486, 636]