### Natrual Language Processing Project:<br>An exploration into Ptichfork Music Reviews

Blake Spencer<br>
March 2019

The goal of this project is to understand how music reivews are written, and see if there are differences between genres or how well the review is written.

You can see my blog post about the project here:<br>
https://blake-spencer-projects.herokuapp.com/nlp

The main steps were: <br>

1. [Scrape all 21000 reviews and save them in a CSV](https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/pitchfork_scrape.ipynb)
2. **Clean the text** (this file)
3. [Topic modeling by sentence](https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/topic_modeling.ipynb)
4. [Visualize the Data](https://blake-spencer-projects.herokuapp.com/nlp)

Each of the links above is a Jupyter Notebook file with Python code to complete each step.

The Flask App backend:

- [Flask app code in Python](https://github.com/blakespencer/personal-site-backend)

The React App frontend:

- [React app code in Javascript](https://github.com/blakespencer/personal-site-frontend)


In [34]:
import pandas as pd
import numpy as np
import pickle
import numpy as np
import nltk
import re
import string

Load in all the different scraped data

In [2]:
df = pd.read_csv('./data/1_360.csv')

In [3]:
data_paths = ['361_720.csv', '721_1080.csv', '1081_1440.csv', '1441_1749.csv']

In [4]:
for path in data_paths:
    df = df.append(pd.read_csv('./data/{}'.format(path)))

In [5]:
len(df)

20986

In [6]:
df.head()

Unnamed: 0,ablum_score,album_year,artist,album_name,text,genres,review_date
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",,February 16 2019
2,5.6,2019,Perfect Son,Cast,Tobiasz Biliński’s Perfect Son is a descendant...,Rock,February 16 2019
3,7.3,2019,Black Taffy,Elder Mantis,Imagine if the Caretaker were more into RZA th...,Experimental,February 16 2019
4,7.0,2019,Ithaca,The Language of Injury,Ithaca’s debut is an invitation for whiplash. ...,Metal,February 16 2019


In [7]:
columns = list(df.columns)

In [8]:
columns[0] = 'album_score'

In [9]:
df.columns = columns

In [11]:
df.head()

Unnamed: 0,album_score,album_year,artist,album_name,text,genres,review_date
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",,February 16 2019
2,5.6,2019,Perfect Son,Cast,Tobiasz Biliński’s Perfect Son is a descendant...,Rock,February 16 2019
3,7.3,2019,Black Taffy,Elder Mantis,Imagine if the Caretaker were more into RZA th...,Experimental,February 16 2019
4,7.0,2019,Ithaca,The Language of Injury,Ithaca’s debut is an invitation for whiplash. ...,Metal,February 16 2019


Create a highly_rated column

In [12]:
highly_rated = df['album_score'] >= 7

In [13]:
def is_highly_rated(row):
    if(row == True):
        return 1
    return 0

In [14]:
df['highly_rated'] = highly_rated.transform(is_highly_rated)

In [15]:
count_nan = len(df) - df.count()

In [16]:
count_nan

album_score        0
album_year         0
artist             1
album_name         3
text               6
genres          2302
review_date        0
highly_rated       0
dtype: int64

In [17]:
df = df.dropna(subset=['album_name', 'text', 'artist'])

Replace NaN for genre

In [22]:
def is_nan(row):
    if(type(row) == type("")):
        return False
    return np.isnan(row)

In [23]:
def genre_nan_replace(row):
    if(is_nan(row)):
        return 'No Genre'
    return row

In [24]:
df['genres'] = df['genres'].transform(genre_nan_replace)

In [27]:
df.head()

Unnamed: 0,album_score,album_year,artist,album_name,text,genres,review_date,highly_rated
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019,1
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",No Genre,February 16 2019,1
2,5.6,2019,Perfect Son,Cast,Tobiasz Biliński’s Perfect Son is a descendant...,Rock,February 16 2019,0
3,7.3,2019,Black Taffy,Elder Mantis,Imagine if the Caretaker were more into RZA th...,Experimental,February 16 2019,1
4,7.0,2019,Ithaca,The Language of Injury,Ithaca’s debut is an invitation for whiplash. ...,Metal,February 16 2019,1


In [28]:
df.highly_rated.value_counts(normalize=True)

1    0.622235
0    0.377765
Name: highly_rated, dtype: float64

Creating a custom artist list to replace names

In [29]:
artists = list(df['artist'].unique())
artists.append('Joey Bada')
artists.append('Smashing Pumpkins')
artists.append('Jimi Hendrix')
artists.append('Jane s Addiction')
artists.append('Bob Marley')

For NLP, I need to concatenate the artists' names as well a replace when they are referenced by their last name <br>
So if the artist's full name appears in the document replace every reference. There are exceptions when sometimes they refer to the artist by the first name later rather than the last name

In [31]:
def replace_artist_name(text, artists=artists):
    text = text.replace('\n', '')
    for artist in artists:
        if artist + ' ' in text:
            artist_name = artist.replace(' ', '')
            words = artist.split()
            text = text.replace(artist + ' ', artist_name + ' ')
            if(len(words) == 2):
                text = text.replace(words[1] + ' ', artist_name + " ")
                if(words[0] != 'The' or 'the'):
                    text = text.replace(words[0] + ' ', artist_name + ' ')
            text = text.replace(words[0]+artist_name, artist_name)
    return text

Create functions to get rid of punctuation as well as lower, plus get rid of "'s" so it easier to clean artist's name

In [32]:
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x)
lower = lambda x: x.lower()
no_apostrophy_s = lambda x: re.sub("’s",'', x)

In [35]:
df['text_clean'] = df.text.map(alphanumeric).map(punc).map(no_apostrophy_s).map(replace_artist_name).map(lower)

In [36]:
df.head(2)

Unnamed: 0,album_score,album_year,artist,album_name,text,genres,review_date,highly_rated,text_clean
0,9.0,1998,Tortoise,TNT,Imagine a graphic showing all the bands the fi...,"Experimental, Rock",February 17 2019,1,imagine a graphic showing all the bands the fi...
1,8.1,2019,Various Artists,Powder in Space,"In the 19th century, as unions throughout Amer...",No Genre,February 16 2019,1,in the century as unions throughout america...


In [40]:
with open('df_clean.pkl', 'wb') as picklefile:
    pickle.dump(df, picklefile)