### Webscraping

**OBJECTIVES**

- Use `BeautifulSoup` to parse HTML 
- Scape websites and structure their data in `DataFrame`
- Build models using text as input
- Use `CountVectorizer` to create numeric representation of text
- Use `selenium` library to interact with web pages

In [1]:
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

### HTML

Below is some basic HTML.  It is always housed in tags, and we will use these tags to locate elements of a webpage that we want to extract.

In [2]:
some_html = """
<h1>Hello</h1>
<p>This is a paragraph</p>
<p class = "second">This is another paragraph</p>
<div>
<a href = 'www.google.com'><p>Your Friend</p></a>
"""

#### `BeautifulSoup`

To located elements in HTML, we use the `BeautifulSoup` library.  

In [5]:
#make the soup
soup = BeautifulSoup(some_html)
type(soup)
soup

<html><body><h1>Hello</h1>
<p>This is a paragraph</p>
<p class="second">This is another paragraph</p>
<div>
<a href="www.google.com"><p>Your Friend</p></a>
</div></body></html>

In [6]:
#find first h1 tag
soup.find('h1')

<h1>Hello</h1>

In [7]:
# find first p tag
soup.find('p')

<p>This is a paragraph</p>

In [8]:
# find all paragraphs
soup.find_all('p')

[<p>This is a paragraph</p>,
 <p class="second">This is another paragraph</p>,
 <p>Your Friend</p>]

In [11]:
#extract text from p tags
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.text)

This is a paragraph
This is another paragraph
Your Friend


In [12]:
#extract based on css 
soup.find('p', {'class': 'second'})

<p class="second">This is another paragraph</p>

### Getting HTML

Usually, we want to use a webpage to extract information from.  To get the HTML we use `requests` and turn the text of the response into `BeautifulSoup` objects.

In [13]:
url = 'https://pitchfork.com/reviews/albums/'

In [15]:
#make a request
r = requests.get(url)
r

<Response [200]>

In [16]:
#turn it to soup
soup = BeautifulSoup(r.text, 'html.parser')

In [18]:
# soup

#### Finding elements

Typically, this is a bit of a dance with the inspect tool in your browswer.  Let's try to find each individual review as a start.

In [19]:
# first review
soup.find('div', {'class': 'review'})

<div class="review"><a class="review__link" href="/reviews/albums/phoenix-alpha-zulu/"><div class="review__artwork artwork"><div class=""><img alt="Phoenix: Alpha Zulu" src="https://media.pitchfork.com/photos/6318afadafa92f85cc3e46c6/1:1/w_160/phoenix-alpha-zulu.jpg"/></div></div><div class="review__title"><ul class="artist-list review__title-artist"><li>Phoenix</li></ul><h2 class="review__title-album"><em>Alpha Zulu</em></h2></div></a><div class="review__meta"><ul class="genre-list genre-list--inline review__genre-list"><li class="genre-list__item"><a class="genre-list__link" href="/reviews/albums/?genre=rock">Rock</a></li></ul><ul class="authors"><li><a class="linked display-name display-name--linked" href="/staff/brady-brickner-wood/"><span class="by">by: </span>Brady Brickner-Wood</a></li></ul><time class="pub-date" datetime="2022-11-10T05:03:00" title="Thu, 10 Nov 2022 05:03:00 GMT">10 hrs ago</time></div></div>

In [21]:
# first image
soup.find('img').attrs

{'src': 'https://media.pitchfork.com/photos/6318afadafa92f85cc3e46c6/1:1/w_160/phoenix-alpha-zulu.jpg',
 'alt': 'Phoenix: Alpha Zulu'}

In [22]:
# extract url
soup.find('img').attrs['src']

'https://media.pitchfork.com/photos/6318afadafa92f85cc3e46c6/1:1/w_160/phoenix-alpha-zulu.jpg'

In [23]:
from IPython.display import Image

In [24]:
# visualize album cover
Image(url = soup.find('img').attrs['src'])

#### Extracting data from reviews

- Album
- Artist
- Genre
- Reviewer
- When
- Cover Art
- Full review url

In [45]:
reviews = soup.find_all('div', {'class': 'review'})
reviews[-1].find('a', {'class': 'display-name'}).text #reviewer
reviews[-1].find('time').attrs['datetime'] #when
reviews[-1].find('a', {'class': 'review__link'}).attrs['href'] #full review link
reviews[-1].find('img').attrs['src'] #image art

'https://media.pitchfork.com/photos/6369229dfe4a5479c798061b/1:1/w_160/Duke%20Deuce%20-%20Memphis%20Massacre%20III.jpeg'

In [48]:
#lists to hold our data
album_names = []
artists = []
genres = []
reviewers = []
whens = []
links = []
covers = []
for review in reviews:
    artists.append(review.find('li').text) # get the artist
    album_names.append(review.find('em').text) # album name
    genres.append(review.find('a', {'class': 'genre-list__link'}).text) # genre
    reviewers.append(review.find('a', {'class': 'display-name'}).text) # reviewer
    whens.append(review.find('time').attrs['datetime'] ) # when
    links.append('https://pitchfork.com' + review.find('a', {'class': 'review__link'}).attrs['href']) # link to full review
    covers.append(review.find('img').attrs['src']) # album cover image

In [49]:
links

['https://pitchfork.com/reviews/albums/phoenix-alpha-zulu/',
 'https://pitchfork.com/reviews/albums/bandmanrill-club-godfather/',
 'https://pitchfork.com/reviews/albums/bluebucksclan-clan-way-3/',
 'https://pitchfork.com/reviews/albums/aoife-nessa-frances-protector/',
 'https://pitchfork.com/reviews/albums/dawn-richard-spencer-zahn-pigments/',
 'https://pitchfork.com/reviews/albums/tenci-a-swollen-river-a-well-overflowing/',
 'https://pitchfork.com/reviews/albums/knifeplay-animal-drowning/',
 'https://pitchfork.com/reviews/albums/sobs-air-guitar/',
 'https://pitchfork.com/reviews/albums/drake-21-savage-her-loss/',
 'https://pitchfork.com/reviews/albums/hawa-hadja-bangoura/',
 'https://pitchfork.com/reviews/albums/okay-kaya-sap/',
 'https://pitchfork.com/reviews/albums/duke-deuce-memphis-massacre-iii/']

#### Using the links to extract the data

In [50]:
full_review_urls = links[0]

In [58]:
review_texts = []
scores = []
for link in links:
    r = requests.get(link)
    review_soup = BeautifulSoup(r.text, 'html.parser') # follow link to full review
    review_texts.append(review_soup.find('div', {'class': 'body__inner-container'}).text) #extract full review text
    scores.append(review_soup.find('div', {'class': 'ScoreCircle-cIILhI'}).text) #extract the score

In [59]:
scores

['7.1',
 '7.3',
 '7.3',
 '7.5',
 '8.3',
 '7.5',
 '7.3',
 '7.5',
 '6.4',
 '7.4',
 '7.5',
 '7.7']

In [77]:
album_names = []
artists = []
genres = []
reviewers = []
whens = []
links = []
covers = []
review_texts = []
scores = []
for i in range(1, 6):
    url = f'https://pitchfork.com/reviews/albums/?page={i}'
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    reviews = soup.find_all('div', {'class': 'review'})

    for review in reviews:
        try:
            artists.append(review.find('li').text)
        except:
            artists.append('unknown')
        try:
            album_names.append(review.find('em').text)
        except:
            album_names.append('unknown')
        try:
            genres.append(review.find('a', {'class': 'genre-list__link'}).text)
        except:
            genres.append('unknown')
        try:
            reviewers.append(review.find('a', {'class': 'display-name'}).text)
        except:
            reviewers.append('unknown')
        whens.append(review.find('time').attrs['datetime'] )
        links.append('https://pitchfork.com' + review.find('a', {'class': 'review__link'}).attrs['href'])
        covers.append(review.find('img').attrs['src'])

In [109]:
### THIS CODE WILL LOOP OVER THE LINKS AND EXTRACT TEXT AND SCORE FROM EACH
# for link in links:
#     r = requests.get(link)
#     review_soup = BeautifulSoup(r.text, 'html.parser')
#     try:
#         review_texts.append(review_soup.find('div', {'class': 'body__inner-container'}).text)
#     except:
#         review_texts.append('unknown')
#     try:
#         scores.append(review_soup.find('div', {'class': 'ScoreCircle-cIILhI'}).text)
#     except:
#         scores.append(np.nan)
#     print(link)

In [103]:
review_df = pd.DataFrame({'review': review_texts, 'scores': scores})
review_df.head()

Unnamed: 0,review,scores
0,How does a band as definitively springy as Pho...,7.1
1,Any song backed by the iconic triple kick drum...,7.3
2,"For BlueBucksClan, popping bottles with models...",7.3
3,"Throughout “Day Out of Time,” the closing song...",7.5
4,Dawn Richard’s music feels as if it’s emanatin...,8.3


In [104]:
review_df.shape

(5727, 2)

In [105]:
review_df['artist'] = artists[:len(review_df)]

In [106]:
review_df['album'] = album_names[:len(review_df)]
review_df['genre'] = genres[:len(review_df)]
review_df['date'] = whens[:len(review_df)]

In [107]:
review_df.to_csv('reviews.csv', index=False)

In [108]:
review_df.tail()

Unnamed: 0,review,scores,artist,album,genre,date
5722,Every member of the international avant-garde ...,7.9,Nazoranai,"Beginning to Fall in Line Before Me, So Decoro...",Experimental,2017-10-21T05:00:00
5723,Dan Bejar has been recording as Destroyer for ...,7.9,Destroyer,ken,Rock,2017-10-20T05:00:00
5724,The events that inspired Reaching for Indigo—H...,8.2,Circuit des Yeux,Reaching for Indigo,Rock,2017-10-20T05:00:00
5725,unknown,,The Jam,unknown,Rock,2017-10-20T05:00:00
5726,"Like many new rappers, what makes G Herbo inte...",7.8,G Herbo,Humble Beast,Rap,2017-10-20T05:00:00


### Getting more data

Looking at the url perhaps there is an idea for how to extract the first 10 pages of review data.

### Getting full review and score

Now, we have url's to our full reviews.  Let's use these to extract the score and full text of the review.

### Good vs. Bad

What score would you say makes an album good vs. a score that should be considered bad?

### Text Representation

To use this in a model we need to turn our text into a numeric representation.  A basic approach to this is to use the count of individual words as features.  Here, we can use the `CountVectorizer` to transform the data.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>the</th>      <th>dog</th>      <th>ate</th>      <th>salami</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0</td>      <td>1</td>      <td>1</td>      <td>1</td>    </tr>    <tr>      <th>1</th>      <td>1</td>      <td>1</td>      <td>0</td>      <td>1</td>    </tr>  </tbody></table>


In [68]:
# instantiate count vectorizer
cvect = CountVectorizer()

In [70]:
# fit and transform the data
dtm = cvect.fit_transform(review_df['review'])

In [71]:
# kind of thing?
dtm

<60x7619 sparse matrix of type '<class 'numpy.int64'>'
	with 18986 stored elements in Compressed Sparse Row format>

In [None]:
# convert to dense
dtm.toarray()

In [None]:
# words?
cvect.get_feature_names_out()

In [None]:
# dataframe
dtm_df = pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out())
dtm_df.head()

In [None]:
# pipeline


In [None]:
# model


In [None]:
# score it


In [None]:
# coefficients


### `selenium`

Selenium allows you to interact with the webpage directly using a webdriver.  

https://selenium-python.readthedocs.io/installation.html#drivers

In [None]:
# !pip install -U selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [None]:
driver = webdriver.Chrome()
driver.get("http://www.python.org")

In [None]:
#find element
elem = driver.find_element(By.NAME, "q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
driver.close()

#### Important Examples

In [None]:
url = 'https://www.slapmagazine.com/'

In [None]:
driver = webdriver.Chrome()
driver.get(url)

'''<input class="search_input" type="text" name="search" value="Search..." 
onfocus="this.value = '';" onblur="if(this.value=='') this.value='Search...';">'''

In [None]:
element = driver.find_element(By.XPATH, "//input[@class='search_input']")

In [None]:
element.send_keys("Drake")

In [None]:
results = element.submit()

In [None]:
elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")

In [None]:
next_page = driver.find_element(By.XPATH, "//a[@class='navPages']")

In [None]:
next_page.click()

In [None]:
next_page = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")

In [None]:
pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
pages[1].click()

In [None]:
pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
pages[2].click()

In [None]:
url = 'https://www.slapmagazine.com/'
driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element(By.XPATH, "//input[@class='search_input']")
element.send_keys("Drake")
element.submit()


titles = []
for i in range(5):
    elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")
    for elem in elements:
        titles.append(elem.text)
    pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")
    pages[i].click()

In [None]:
driver.close()