# Lab 04: Scraping Reviews

**GOALS**: 

- Scrape album reviews from Pitchfork
- Scrape album images from Pitchfork


## LEVEL I

In the last example [*intro to webscraping*](08-Beautiful-Soup-Scraping.ipynb), we extracted basic information from the page containing all reviews on **pitchfork.com**.  Now, your task is first, to scrape the links to each review page.  This is akin to clicking on the review, and being taken to the page with the full review.

![](images/pitch_ind.png)

At each page, your goal is to scrape the headline, the text of the review, the score as a number, the author, genre, and date.  If you're feeling ambitious, grab the sample music files when they exist.

In [49]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [50]:
page_num = [i for i in range(2, 400)]
url = 'https://pitchfork.com/reviews/albums/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
reviews = soup.find_all('div', {'class': 'review'})

In [51]:
artists = []
albums = []
genre = []
author = []
when = []
links = []
for review in reviews:
    t = review.find('li').text
    artists.append(t)
    s = review.find('h2').text
    albums.append(s)
    genre.append(review.find('li',{'class': 'genre-list__item'}).text)
    author.append(review.find('ul', {'class': 'authors'}).text)
    when.append(review.find('time').text)
    links.append(review.a['href'])

In [52]:
df_reviews = pd.DataFrame({'artist': artists, 'albums': albums, 'genre': genre, 'author': author, 'when': when})

In [53]:
links = []
for i in reviews:
    links.append(i.a['href'])

In [54]:
links

['/reviews/albums/maxwell-maxwells-urban-hang-suite/',
 '/reviews/albums/pinegrove-skylight/',
 '/reviews/albums/joe-strummer-joe-strummer-001/',
 '/reviews/albums/lupe-fiasco-drogas-wave/',
 '/reviews/albums/father-awful-swim/',
 '/reviews/albums/tim-hecker-konoyo/',
 '/reviews/albums/mount-kimbie-dj-kicks/',
 '/reviews/albums/roc-marciano-behold-a-dark-horse/',
 '/reviews/albums/brandon-coleman-resistance/',
 '/reviews/albums/metric-art-of-doubt/',
 '/reviews/albums/lonnie-holley-mith/',
 '/reviews/albums/ryan-hemsworth-elsewhere/']

In [55]:
brief = []
full = []
for i in links:
    url = 'https://pitchfork.com/' + str(i)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    brief.append(soup.find('div', {'class': 'review-detail__abstract'}).text)
    full.append(soup.find('div', {'class': 'contents dropcap'}).text)

In [56]:
len(full)

12

In [57]:
len(brief)

12

In [58]:
len(artists)

12

In [59]:
url = 'https://pitchfork.com/reviews/albums/' + '?page=2'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
reviews = soup.find_all('div', {'class': 'review'})
for review in reviews:
    t = review.find('li').text
    artists.append(t)
    s = review.find('h2').text
    albums.append(s)
    genre.append(review.find('li',{'class': 'genre-list__item'}).text)
    author.append(review.find('ul', {'class': 'authors'}).text)
    when.append(review.find('time').text)
    links.append(review.a['href'])

In [60]:
len(artists)

24

In [61]:
for i in links[12:]:
    url = 'https://pitchfork.com/' + str(i)
    resp = requests.get(url)
    soup2 = BeautifulSoup(resp.text, 'html.parser')
    brief.append(soup2.find('div', {'class': 'review-detail__abstract'}).text)
    full.append(soup2.find('div', {'class': 'contents dropcap'}).text)

In [62]:
len(full)

24

In [63]:
len(artists)

24

In [64]:
df_reviews = pd.DataFrame({'artist': artists, 'albums': albums, 'genre': genre, 'author': author, 'when': when, 'brief': brief, 'full': full})

In [65]:
df_reviews.head()

Unnamed: 0,albums,artist,author,brief,full,genre,when
0,Maxwell’s Urban Hang Suite,Maxwell,by: Jason King,"Each Sunday, Pitchfork takes an in-depth look ...","In the late summer of 1996, Harlem was a loopy...",Pop/R&B,19 hrs ago
1,Skylight,Pinegrove,by: Quinn Moreland,Completed in 2017 and shelved for almost a yea...,"A year ago, it looked like Pinegrove’s next mo...",Rock,September 29 2018
2,Joe Strummer 001,Joe Strummer,by: Stephen Thomas Erlewine,This 32-track collection combines remastered r...,Timing never was Joe Strummer’s strong suit. N...,Rock,September 29 2018
3,Drogas Wave,Lupe Fiasco,by: Brian Josephs,"On his seventh album, the conscious hip-hop fa...",Conscious hip-hop exists in a state of perpetu...,Rap,September 29 2018
4,Awful Swim,Father,by: Briana Younger,"Having relocated to L.A., the Atlanta rapper r...",Father’s music has always been flippant. That ...,Rap,September 29 2018


In [69]:
url = 'https://pitchfork.com/reviews/albums/?page=' 
for i in range(1, 1714):
    response = requests.get(url + str(i))
    soup = BeautifulSoup(response.text, 'html.parser')
    reviews = soup.find_all('div', {'class': 'review'})

    for review in reviews:
        t = review.find('li').text
        artists.append(t)
        s = review.find('h2').text
        albums.append(s)
        genre.append(review.find('li',{'class': 'genre-list__item'}))
        author.append(review.find('ul', {'class': 'authors'}).text)
        when.append(review.find('time').text)
        links.append(review.a['href'])

In [70]:
len(artists)

20656

In [71]:
len(albums)

20656

In [72]:
len(genre)

20654

In [73]:
len(links)

20654

In [76]:
brief = []
full = []
for i in links:
    url = 'https://pitchfork.com/' + str(i)
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    brief.append(soup.find('div', {'class': 'review-detail__abstract'}))
    full.append(soup.find('div', {'class': 'contents dropcap'}))

In [77]:
len(brief)

20654

In [79]:
len(full)

20654

In [81]:
full[10].text



## LEVEL II

Go back to the original page of reviews and scroll down.  Notice that the url at the top of the page is simply adding numbers as it advances.  This pattern will allow you to scrape multiple pages, and gather more reviews from earlier dates.  

1. Directly add the next reviews to a new url, and use your pattern above to scrape the additional reviews.
2. Write a loop to go through the next ten pages of reviews and gather each piece.

## LEVEL III



Write a loop to go through all reviews available.  Save the results as a `.csv` file.  If you were able to scape the images; store these in a folder.

## LEVEL IV

It is easy to use the `textblob` library to add sentiment and polarity of reviews to our `DataFrame`.  We need to convert the text to a `TextBlob` object, and then use the `.polarity` and `.subjectivity` labels of the text as new columns in our `DataFrame`.  Use the example below as a starting place to add two new columns to your dataframe containing the polarity and subjectivity scores for each review.

In [1]:
rev = "Danielle Bregoli’s leap from meme to rapper continues with her debut mixtape that leans heavily on mimicry and trails dreadfully behind the current sound of hip-hop."

In [2]:
from textblob import TextBlob

In [3]:
text = TextBlob(rev)

In [5]:
text.sentiment

Sentiment(polarity=-0.05000000000000002, subjectivity=0.5)

In [6]:
text.polarity

-0.05000000000000002

In [8]:
text.subjectivity

0.5