### Webscraping

**OBJECTIVES**

- Use `BeautifulSoup` to parse HTML 
- Scape websites and structure their data in `DataFrame`
- Build models using text as input
- Use `CountVectorizer` to create numeric representation of text
- Use `selenium` library to interact with web pages

In [None]:
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

### HTML

Below is some basic HTML.  It is always housed in tags, and we will use these tags to locate elements of a webpage that we want to extract.

In [None]:
some_html = """
<h1>Hello</h1>
<p>This is a paragraph</p>
<p class = "second">This is another paragraph</p>
<div>
<a href = 'www.google.com'><p>Your Friend</p></a>
"""

#### `BeautifulSoup`

To located elements in HTML, we use the `BeautifulSoup` library.  

In [None]:
#make the soup


In [None]:
#find first h1 tag


In [None]:
# find first p tag


In [None]:
# find all paragraphs


In [None]:
#extract text from p tags


In [None]:
#extract based on css 


### Getting HTML

Usually, we want to use a webpage to extract information from.  To get the HTML we use `requests` and turn the text of the response into `BeautifulSoup` objects.

In [None]:
url = 'https://pitchfork.com/reviews/albums/'

In [None]:
#make a request


In [None]:
#turn it to soup


#### Finding elements

Typically, this is a bit of a dance with the inspect tool in your browswer.  Let's try to find each individual review as a start.

In [None]:
# first review


In [None]:
# first image


In [None]:
# extract url


In [None]:
from IPython.display import Image

In [None]:
# visualize album cover


#### Extracting data from reviews

- Album
- Artist
- Genre
- Reviewer
- When
- Cover Art
- Full review url

### Getting more data

Looking at the url perhaps there is an idea for how to extract the first 10 pages of review data.

### Getting full review and score

Now, we have url's to our full reviews.  Let's use these to extract the score and full text of the review.

### Good vs. Bad

What score would you say makes an album good vs. a score that should be considered bad?

### Text Representation

To use this in a model we need to turn our text into a numeric representation.  A basic approach to this is to use the count of individual words as features.  Here, we can use the `CountVectorizer` to transform the data.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>the</th>      <th>dog</th>      <th>ate</th>      <th>salami</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>0</td>      <td>1</td>      <td>1</td>      <td>1</td>    </tr>    <tr>      <th>1</th>      <td>1</td>      <td>1</td>      <td>0</td>      <td>1</td>    </tr>  </tbody></table>


In [None]:
# instantiate count vectorizer


In [None]:
# fit and transform the data


In [None]:
# kind of thing?


In [None]:
# convert to dense


In [None]:
# words?


In [None]:
# dataframe


In [None]:
# pipeline


In [None]:
# model


In [None]:
# score it


In [None]:
# coefficients


### `selenium`

Selenium allows you to interact with the webpage directly using a webdriver.  

https://selenium-python.readthedocs.io/installation.html#drivers

In [None]:
# !pip install -U selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [None]:
driver = webdriver.Chrome()
driver.get("http://www.python.org")

In [None]:
#find element
elem = driver.find_element(By.NAME, "q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
driver.close()

#### Important Examples

In [None]:
url = 'https://www.slapmagazine.com/'

In [None]:
driver = webdriver.Chrome()
driver.get(url)

'''<input class="search_input" type="text" name="search" value="Search..." 
onfocus="this.value = '';" onblur="if(this.value=='') this.value='Search...';">'''

In [None]:
element = driver.find_element(By.XPATH, "//input[@class='search_input']")

In [None]:
element.send_keys("Drake")

In [None]:
results = element.submit()

In [None]:
elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")

In [None]:
next_page = driver.find_element(By.XPATH, "//a[@class='navPages']")

In [None]:
next_page.click()

In [None]:
next_page = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")

In [None]:
pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
pages[1].click()

In [None]:
pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")

In [None]:
pages[2].click()

In [None]:
url = 'https://www.slapmagazine.com/'
driver = webdriver.Chrome()
driver.get(url)
element = driver.find_element(By.XPATH, "//input[@class='search_input']")
element.send_keys("Drake")
element.submit()


titles = []
for i in range(5):
    elements = driver.find_elements(By.XPATH, "//div[@class='search_results_posts']")
    for elem in elements:
        titles.append(elem.text)
    pages = driver.find_elements(By.XPATH, "//a[@class='navPages']")
    pages[i].click()

In [None]:
driver.close()