### Natrual Language Processing Project:<br>An exploration into Ptichfork Music Reviews

Blake Spencer<br>
March 2019

The goal of this project is to understand how music reivews are written, and see if there are differences between genres or how well the review is written.

You can see my blog post about the project here:<br>
https://blake-spencer-projects.herokuapp.com/nlp

The main steps were: <br>

1. [Scrape all 21000 reviews and save them in a CSV](https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/pitchfork_scrape.ipynb)
2. [Clean the text] (https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/cleaning_data.ipynb)
3. **Topic modeling by sentence** (this file)
4. [Visualize the Data](https://blake-spencer-projects.herokuapp.com/nlp)

Each of the links above is a Jupyter Notebook file with Python code to complete each step.

The Flask App backend:

- [Flask app code in Python](https://github.com/blakespencer/personal-site-backend)

The React App frontend:

- [React app code in Javascript](https://github.com/blakespencer/personal-site-frontend)

In [52]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.decomposition import NMF
from nltk.tokenize import sent_tokenize
import re
import string
import numpy as np

In [6]:
with open('df_clean.pkl', 'rb') as picklefile:
    df = pickle.load(picklefile)

In [7]:
df.head()

Unnamed: 0,album_score,album_year,artist,album_name,text,genres,review_date,highly_rated,text_clean
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019,1,imagine a graphic showing all the bands the fi...
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",No Genre,February 16 2019,1,in the century as unions throughout america...
2,5.6,2019,Perfect Son,Cast,Tobiasz Biliński’s Perfect Son is a descendant...,Rock,February 16 2019,0,tobiasz biliński perfectson is a descendant of...
3,7.3,2019,Black Taffy,Elder Mantis,Imagine if the Caretaker were more into RZA th...,Experimental,February 16 2019,1,imagine if the caretaker were more into rza th...
4,7.0,2019,Ithaca,The Language of Injury,Ithaca’s debut is an invitation for whiplash. ...,Metal,February 16 2019,1,ithaca debut is an invitation for whiplash th...


# Topic Modeling by Sentence 
- Topic Modeling by document ended up giving me genre's as topics. I want to see if it's possible to generate topics based on how reviews are written, not by topic

In [11]:
df['id'] = df['artist'] + df['album_name'] + df['review_date']

In [12]:
df_sentence = df[['id', 'text']]

In [13]:
def create_sentence_split(row, data_sentence=[]):
    text_sentences = text.split('.')
    for sentence in text_sentences:
        data_sentence.append([row.id, sentence])

In [14]:
text_sentences = []

In [15]:
ids = list(df['id'])
texts = list(df['text'])

In [18]:
pd.DataFrame(text_sentences)

In [22]:
artists = list(df['artist'].unique())
artists.append('Joey Bada')
artists.append('Smashing Pumpkins')
artists.append('Jimi Hendrix')
artists.append('Jane s Addiction')
artists.append('Bob Marley')

In [23]:
def replace_artist_name_v2(text, artists=artists):
    text = text.replace('\n', '')
    for artist in artists:
        if artist in text:
            if(artist == 'The Velvet Underground'):
                artist_name = 'TheVelvetUnderground'
                text = text.replace('Velvet Underground' + ' ', artist_name + " ")
                text = text.replace('Velvet Underground' + '.', artist_name + ".")
                text = text.replace('Velvet Underground' + '!', artist_name + "!")
                text = text.replace('Velvet Underground' + '?', artist_name + "?")
                text = text.replace('Velvets' + ' ', artist_name + ' ')
                text = text.replace('Velvets' + '.', artist_name + ".")
                text = text.replace('Velvets' + '!', artist_name + "!")
                text = text.replace('Velvets' + '?', artist_name + "?")
                
            artist_name = artist.replace(' ', '')
            words = artist.split()
            text = text.replace(artist, artist_name)
            if(len(words) == 2):
                text = text.replace(words[1] + ' ', artist_name + " ")
                text = text.replace(words[1] + '.', artist_name + ".")
                text = text.replace(words[1] + '!', artist_name + "!")
                text = text.replace(words[1] + '?', artist_name + "?")
                if(words[0] != 'The' or 'the'):
                    text = text.replace(words[0] + ' ', artist_name + ' ')
                    text = text.replace(words[0] + '.', artist_name + ".")
                    text = text.replace(words[0] + '!', artist_name + "!")
                    text = text.replace(words[0] + '?', artist_name + "?")
            text = text.replace(words[0]+artist_name, artist_name)
    text = text.replace('-', '')
    return text

In [24]:
df['text_clean_v2'] = df.text.map(replace_artist_name_v2)

Now splitting sentences while keeping the id

In [26]:
def split_sentence(df):
    ids = list(df['id'])
    texts = list(df['text_clean_v2'])
    text_sentences = []
    for idx, review_id in enumerate(ids):
        sentences = sent_tokenize(texts[idx])
        for sentence in sentences:
            text_sentences.append([review_id, sentence])
    return text_sentences

In [30]:
sentences_from_text = split_sentence(df)

In [31]:
df_sentences = pd.DataFrame(sentences_from_text, columns=['review_id', 'sentences'])

In [32]:
df_sentences.head()

Unnamed: 0,review_id,sentences
0,TortoiseTNTFebruary 17 2019,Imagine a graphic showing all the bands the fi...
1,TortoiseTNTFebruary 17 2019,At the top of the funnel you have groups rangi...
2,TortoiseTNTFebruary 17 2019,"In this graphic, Tortoise is the choke point, ..."
3,TortoiseTNTFebruary 17 2019,No album in the band’s initial run embodied th...
4,TortoiseTNTFebruary 17 2019,"Weirdly beautiful and impossible to pin down, ..."


In [38]:
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
lower = lambda x: x.lower()
no_apostrophy_s = lambda x: re.sub("’s",'', x)

In [42]:
df_sentences['clean_sentence'] = df_sentences['sentences'].map(alphanumeric).map(punc).map(no_apostrophy_s).map(lower)

In [43]:
X_sentence = df_sentences['sentences']
y_sentence = df_sentences['review_id']

Creating Custom Stop Words that don't give topics meaning

In [46]:
stop_words = ENGLISH_STOP_WORDS.union(['ve',
                                       'music',
                                       'sound',
                                       'fucking',
                                       'fuck',
                                       'pitchfork',
                                       'white',
                                       'en',
                                       'com',
                                       'media',
                                       'backend',
                                       'js',
                                       'tiny_mce',
                                       'themes',
                                       'advanced',
                                       'langs',
                                       'http',
                                       'script',
                                       'chris',
                                       'cornell',
                                       'john',
                                       'velvet',
                                       'underground',
                                       'did'
                                      ])

Try with different amount of TOPICs and MAX_DFs to see what comes out. Add to Stop Words when appropriate... 
- I settled with 4 topics and a max df of 0.007 since it is done by sentence, max df is very low. If it was bigger, then genre topics would start to form

In [49]:
cv = CountVectorizer(stop_words=stop_words, max_df=0.007)
X = df_sentences['clean_sentence']

X_cv = cv.fit_transform(X)


pd.DataFrame(X_cv.toarray(), columns=cv.get_feature_names()).head(10)

n_topics = 4

model = NMF(n_components=n_topics)
doc_topic_FOUR_pointseven_percent = model.fit_transform(X_cv)

print(model.components_.shape)


top_ten_words = model.components_.argsort(axis=1)[:,-10:]


for i, row in enumerate(top_ten_words):
    print("topic:",i)
    for x in row:
        print("\t",cv.get_feature_names()[x].strip())

(4, 211279)
topic: 0
	 exactly
	 having
	 doing
	 history
	 radio
	 stuff
	 girl
	 fun
	 country
	 story
topic: 1
	 rhythms
	 heavy
	 ambient
	 familiar
	 chords
	 sonic
	 melodic
	 tone
	 slow
	 synths
topic: 2
	 section
	 member
	 musicians
	 plays
	 trio
	 members
	 vocalist
	 bassist
	 guitarist
	 drummer
topic: 3
	 compilation
	 couple
	 recordings
	 fulllength
	 ago
	 releases
	 previous
	 series
	 singles
	 decade


Distrobution of topics

In [54]:
FOUR_pointseven = np.argmax(doc_topic_FOUR_pointseven_percent, axis=1)
    

x = FOUR_pointseven

y = np.bincount(x)
ii = np.nonzero(y)[0]
np.vstack((ii,y[ii])).T

array([[     0, 246627],
       [     1, 166422],
       [     2,  11821],
       [     3,  36663]])

Now look at the actual sentences to see if topics make sense

In [58]:
def get_topic(num, x):
    output = []
    for i in list(enumerate(x[:200])):
        if(num == i[1]):
            output.append(i)
    return output

In [59]:
get_topic(1, FOUR_pointseven)

[(2, 1),
 (4, 1),
 (6, 1),
 (13, 1),
 (17, 1),
 (18, 1),
 (20, 1),
 (24, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (35, 1),
 (37, 1),
 (40, 1),
 (42, 1),
 (44, 1),
 (45, 1),
 (47, 1),
 (52, 1),
 (53, 1),
 (56, 1),
 (57, 1),
 (58, 1),
 (59, 1),
 (60, 1),
 (61, 1),
 (62, 1),
 (65, 1),
 (66, 1),
 (75, 1),
 (77, 1),
 (79, 1),
 (80, 1),
 (81, 1),
 (83, 1),
 (84, 1),
 (86, 1),
 (87, 1),
 (91, 1),
 (92, 1),
 (95, 1),
 (96, 1),
 (98, 1),
 (99, 1),
 (100, 1),
 (101, 1),
 (102, 1),
 (103, 1),
 (105, 1),
 (106, 1),
 (107, 1),
 (108, 1),
 (109, 1),
 (110, 1),
 (111, 1),
 (112, 1),
 (113, 1),
 (114, 1),
 (115, 1),
 (117, 1),
 (121, 1),
 (123, 1),
 (125, 1),
 (127, 1),
 (128, 1),
 (130, 1),
 (131, 1),
 (133, 1),
 (141, 1),
 (142, 1),
 (143, 1),
 (144, 1),
 (145, 1),
 (151, 1),
 (153, 1),
 (156, 1),
 (158, 1),
 (160, 1),
 (164, 1),
 (165, 1),
 (166, 1),
 (167, 1),
 (174, 1),
 (177, 1),
 (178, 1),
 (179, 1),
 (186, 1),
 (187, 1),
 (189, 1),
 (199, 1)]

Now attach distrobution of topics to actual review

In [60]:
df_sentences['topic'] = FOUR_pointseven

In [61]:
df_sentences.head()

Unnamed: 0,review_id,sentences,clean_sentence,topic
0,TortoiseTNTFebruary 17 2019,Imagine a graphic showing all the bands the fi...,imagine a graphic showing all the bands the fi...,0
1,TortoiseTNTFebruary 17 2019,At the top of the funnel you have groups rangi...,at the top of the funnel you have groups rangi...,3
2,TortoiseTNTFebruary 17 2019,"In this graphic, Tortoise is the choke point, ...",in this graphic tortoise is the choke point ...,1
3,TortoiseTNTFebruary 17 2019,No album in the band’s initial run embodied th...,no album in the band initial run embodied that...,0
4,TortoiseTNTFebruary 17 2019,"Weirdly beautiful and impossible to pin down, ...",weirdly beautiful and impossible to pin down ...,1


In [64]:
df_sentences = pd.get_dummies(df_sentences, columns=['topic'])

In [65]:
df_sentences.head()

Unnamed: 0,review_id,sentences,clean_sentence,topic_0,topic_1,topic_2,topic_3
0,TortoiseTNTFebruary 17 2019,Imagine a graphic showing all the bands the fi...,imagine a graphic showing all the bands the fi...,1,0,0,0
1,TortoiseTNTFebruary 17 2019,At the top of the funnel you have groups rangi...,at the top of the funnel you have groups rangi...,0,0,0,1
2,TortoiseTNTFebruary 17 2019,"In this graphic, Tortoise is the choke point, ...",in this graphic tortoise is the choke point ...,0,1,0,0
3,TortoiseTNTFebruary 17 2019,No album in the band’s initial run embodied th...,no album in the band initial run embodied that...,1,0,0,0
4,TortoiseTNTFebruary 17 2019,"Weirdly beautiful and impossible to pin down, ...",weirdly beautiful and impossible to pin down ...,0,1,0,0


In [66]:
df_topic_count = df_sentences.groupby(['review_id']).agg({'topic_0': 'sum', 'topic_1': 'sum', 'topic_2': 'sum', 'topic_3': 'sum'})


In [67]:
df['review_id'] = df['id']

In [68]:
df_merged = df.merge(df_topic_count, on='review_id')

In [71]:
df_merged['total'] = df_merged['topic_0'] + df_merged['topic_1'] + df_merged['topic_2'] + df_merged['topic_3']

In [72]:
df_merged['topic_0'] = df_merged['topic_0'] / df_merged['total']

df_merged['topic_1'] = df_merged['topic_1'] / df_merged['total']

df_merged['topic_2'] = df_merged['topic_2'] / df_merged['total']

df_merged['topic_3'] = df_merged['topic_3'] / df_merged['total']

In [73]:
df_merged['year'] = df_merged['review_date'].transform(lambda i : i[-4:])

In [74]:
columns = list(df_merged.columns)

In [75]:
columns[-3] = 'History/Context'
columns[-4] = 'Artist'
columns[-5] = 'Opinion/Interpretation'
columns[-6] = 'Sound'

In [77]:
df_merged.columns = columns

In [78]:
df_merged

Unnamed: 0,album_score,album_year,artist,album_name,text,genres,review_date,highly_rated,text_clean,id,text_clean_v2,review_id,Sound,Opinion/Interpretation,Artist,History/Context,total,year
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019,1,imagine a graphic showing all the bands the fi...,TortoiseTNTFebruary 17 2019,Imagine a graphic showing all the bands the fi...,TortoiseTNTFebruary 17 2019,0.308824,0.485294,0.058824,0.147059,68,2019
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",No Genre,February 16 2019,1,in the century as unions throughout america...,Various ArtistsPowder in SpaceFebruary 16 2019,"In the 19th century, as unions throughout Amer...",Various ArtistsPowder in SpaceFebruary 16 2019,0.421053,0.421053,0.000000,0.157895,19,2019
2,5.6,2019,Perfect Son,Cast,Tobiasz Biliński’s Perfect Son is a descendant...,Rock,February 16 2019,0,tobiasz biliński perfectson is a descendant of...,Perfect SonCastFebruary 16 2019,Tobiasz Biliński’s PerfectSon is a descendant ...,Perfect SonCastFebruary 16 2019,0.545455,0.454545,0.000000,0.000000,11,2019
3,7.3,2019,Black Taffy,Elder Mantis,Imagine if the Caretaker were more into RZA th...,Experimental,February 16 2019,1,imagine if the caretaker were more into rza th...,Black TaffyElder MantisFebruary 16 2019,Imagine if the Caretaker were more into RZA th...,Black TaffyElder MantisFebruary 16 2019,0.000000,0.937500,0.000000,0.062500,16,2019
4,7.0,2019,Ithaca,The Language of Injury,Ithaca’s debut is an invitation for whiplash. ...,Metal,February 16 2019,1,ithaca debut is an invitation for whiplash th...,IthacaThe Language of InjuryFebruary 16 2019,Ithaca’s debut is an invitation for whiplash. ...,IthacaThe Language of InjuryFebruary 16 2019,0.428571,0.523810,0.047619,0.000000,21,2019
5,7.7,2019,Ladytron,Ladytron,"Two Decembers ago, news viewers were gripped b...",Pop/R&B,February 15 2019,1,two decembers ago news viewers were gripped b...,LadytronLadytronFebruary 15 2019,"Two Decembers ago, news viewers were gripped b...",LadytronLadytronFebruary 15 2019,0.550000,0.350000,0.000000,0.100000,20,2019
6,7.2,2019,Broken Social Scene,Let’s Try the After Vol. 1 EP,Given Broken Social Scene’s current status as ...,Rock,February 15 2019,1,given brokensocialscene current status as the ...,Broken Social SceneLet’s Try the After Vol. 1 ...,Given BrokenSocialScene’s current status as th...,Broken Social SceneLet’s Try the After Vol. 1 ...,0.466667,0.466667,0.000000,0.066667,15,2019
7,8.0,2019,Rina Mushonga,In a Galaxy,"Less isn’t always more. In 2014, the Dutch-Zim...",Pop/R&B,February 15 2019,1,less isn’t always more in the dutch zimbab...,Rina MushongaIn a GalaxyFebruary 15 2019,"Less isn’t always more. In 2014, the DutchZimb...",Rina MushongaIn a GalaxyFebruary 15 2019,0.590909,0.318182,0.045455,0.045455,22,2019
8,7.8,2019,King Midas Sound,Solitude,Kevin Martin’s music has always pursued extrem...,Experimental,February 15 2019,1,kevin martin music has always pursued extremes...,King Midas SoundSolitudeFebruary 15 2019,Kevin Martin’s music has always pursued extrem...,King Midas SoundSolitudeFebruary 15 2019,0.470588,0.352941,0.088235,0.088235,34,2019
9,7.5,2019,Wadada Leo Smith,Rosa Parks: Pure Love. An Oratorio of Seven Songs,Trumpeter Wadada Leo Smith has created a bound...,Jazz,February 15 2019,1,trumpeter wadadaleosmith has created a boundar...,Wadada Leo SmithRosa Parks: Pure Love. An Orat...,Trumpeter WadadaLeoSmith has created a boundar...,Wadada Leo SmithRosa Parks: Pure Love. An Orat...,0.363636,0.636364,0.000000,0.000000,11,2019


In [79]:
with open('df_topics.pkl', 'wb') as picklefile:
    pickle.dump(df_merged, picklefile)