### Natrual Language Processing Project:<br>An exploration into Ptichfork Music Reviews

Blake Spencer<br>
March 2019

The goal of this project is to understand how music reivews are written, and see if there are differences between genres or how well the review is written.

You can see my blog post about the project here:<br>
https://blake-spencer-projects.herokuapp.com/nlp

The main steps were: <br>

1. **Scrape all 21000 reviews and save them in a CSV** (this file)
2. [Clean the text](https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/cleaning_data.ipynb)
3. [Topic Model](https://github.com)
4. [Visualize the Data](https://blake-spencer-projects.herokuapp.com/nlp)

Each of the links above is a Jupyter Notebook file with Python code to complete each step.

The Flask App backend:

- [Flask app code in Python](https://github.com/blakespencer/personal-site-backend)

The React App frontend:

- [React app code in Javascript](https://github.com/blakespencer/personal-site-frontend)


This is an example of how to scrape the reviews. The python script [here](https://github.com/blakespencer/nlp-pitchfork-reviews/blob/master/pitchfork_scrape.py) was run on AWS

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy
import time
import csv

In [2]:
url = 'https://pitchfork.com/reviews/albums/?page={}'

In [3]:
def get_review_hrefs(index):
    output = []
    url_page = url.format(index)
    response = requests.get(url_page)
    while(response.status_code != 200):
        time.sleep(numpy.random.exponential(5, 1))
        response = requests.get(url_page)
    html = response.text
    soup = BeautifulSoup(html, 'lxml')
    reviews = soup.find('div', {'class':'fragment-list'}).find_all('div', {'class': 'review'})
    for review in reviews:
        href = review.find('a')
        output.append('https://pitchfork.com' + href.get('href'))
    return output
    

In [4]:
get_review_hrefs(1)

['https://pitchfork.com/reviews/albums/chaka-khan-hello-happiness/',
 'https://pitchfork.com/reviews/albums/sir-babygirl-crush-on-me/',
 'https://pitchfork.com/reviews/albums/efdemin-new-atlantis/',
 'https://pitchfork.com/reviews/albums/nate-young-volume-one-dilemmas-of-identity/',
 'https://pitchfork.com/reviews/albums/avril-lavigne-head-above-water/',
 'https://pitchfork.com/reviews/albums/elena-setien-another-kind-of-revolution/',
 'https://pitchfork.com/reviews/albums/cherushii-maria-minerva-cherushii-and-maria-minerva/',
 'https://pitchfork.com/reviews/albums/jozef-van-wissem-jim-jarmusch-an-attempt-to-draw-aside-the-veil/',
 'https://pitchfork.com/reviews/albums/tortoise-tnt/',
 'https://pitchfork.com/reviews/albums/various-artists-powder-in-space/',
 'https://pitchfork.com/reviews/albums/perfect-son-cast/',
 'https://pitchfork.com/reviews/albums/black-taffy-elder-mantis/']

In [5]:
numpy.random.exponential(5, 1)

array([4.00650832])

In [6]:
def get_album_info(href):
    response = requests.get(href)
    while(response.status_code != 200):
        time.sleep(numpy.random.exponential(5, 1))
        response = requests.get(url_page)
    html = response.text
    soup = BeautifulSoup(html, 'lxml')
    review_details = soup.find('div', {'class': 'review-detail'})
    ablum_score = review_details.find('span', {'class': 'score'}).text
    album_year = review_details.find('span', {'class': 'single-album-tombstone__meta-year'}).text[-4:]
    artist = (review_details
              .find('ul', {'class': 'artist-links artist-list single-album-tombstone__artist-links'}).text)
    album_name = review_details.find('h1', {'class': 'single-album-tombstone__review-title'}).text
    try:
        text = "".join([i.text for i in (review_details
         .find('div', {'class': 'row review-body'})
         .find('div', {'class', 'review-detail__article-content'})
         .find('div', {'class': 'contents dropcap'})
         .find_all('p')
        )])
    except:
        text = ''
    try: 
        genres = ([i.text for i in review_details
                   .find('div', {'class': 'article-meta article-meta--reviews'})
                   .find('ul', {'class': 'genre-list genre-list--before'})
                   .find_all('li')
                   ])
    except:
        genres = []
    review_date = (review_details
                   .find('div', {'class': 'article-meta article-meta--reviews'})
                   .find('time').text
                  )
    return {'ablum_score': ablum_score,
            'album_year': album_year,
            'artist': artist,
            'album_name': album_name,
            'text': text,
            'genres': ', '.join(genres),
            'review_date': review_date
           }
    
    

In [7]:
one_row = [get_album_info('https://pitchfork.com/reviews/albums/tortoise-tnt/')]

In [8]:
keys = one_row[0].keys()
with open('test.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file,  fieldnames=keys)
    dict_writer.writeheader()
    dict_writer.writerows(one_row)

In [195]:
list(keys)

['ablum_score',
 'album_year',
 'artist',
 'album_name',
 'text',
 'genres',
 'review_date']

In [178]:
one_row

[{'ablum_score': '9.0',
  'album_year': '1998',
  'artist': 'Tortoise',
  'album_name': 'TNT',
  'text': 'Imagine a graphic showing all the bands the five members of Tortoise were in before they came together and then all the bands they went on to play with after. At the top of the funnel you have groups ranging from dreamy psych-rock to earthy post-punk crunch, including Eleventh Dream Day, Bastro, Slint, and the Poster Children; on the “post-Tortoise” end are groups focusing on electro-jazz and twangy instrumental rock like Isotope 217, Chicago Underground, and Brokeback. In this graphic, Tortoise is the choke point, the one project that has elements of all these sounds but is never defined by nor committed to any of them.Instead, Tortoise floats free, a planchette moving over a Ouija board guided by 10 sets of fingers, where everyone watches the arrow float in one direction but no one is quite sure how it gets there or who is doing the pushing. No album in the band’s initial run emb

In [None]:
def write_csv_reviews(lower_range, upper_range):
    keys = ([
             'ablum_score',
             'album_year',
             'artist',
             'album_name',
             'text',
             'genres',
             'review_date'
            ])
    rows = []
    count = 0
    for i in range(lower_range, upper_range + 1):
        sleep_seconds = numpy.random.exponential(5, 1)
        time.sleep(sleep_seconds)
        hrefs = get_review_hrefs(i)
        for link in hrefs:  
            count += 1
            print(count)
            print(link)
            rows.append(get_album_info(link))
    with open('{}_{}.csv'.format(lower_range, upper_range), 'w') as output_file: 
        dict_writer = csv.DictWriter(output_file,  fieldnames=keys)
        dict_writer.writeheader()
        dict_writer.writerows(rows)
                
            

In [9]:
problem_url = 'https://pitchfork.com/reviews/albums/cocteau-twins-treasure-hiding-the-fontana-years/'

In [11]:
[get_album_info('https://pitchfork.com/reviews/albums/this-heat-repeatmetalmade-availablelive-80-81/')]

[{'ablum_score': '8.0',
  'album_year': ' • ',
  'artist': 'This Heat',
  'album_name': 'Repeat/Metal',
  'text': 'This Heat’s music has always felt unstable and unsettled. The work that this London post-punk trio created between 1979 and 1981, over two studio albums and a lone EP,  was a charged concoction made from equal parts dub, world music, musique concrète-inspired tape-loop experiments, and progressive rock. But perfection was never the goal: The records were rough, unpolished flashes of an ongoing, ferocious creative process, with nearly all the material starting abruptly and ending with either a hard edit or a slow fadeout. Listening to them felt like catching little snapshots when the studio door swung open for a few tantalizing minutes.This aesthetic was born of This Heat’s working methods, which found the three men (drummer Charles Hayward and multi-instrumentalists Charles Bullen and Gareth Williams) spending hours recording jam sessions in their studio, Cold Storage, a f