# Data Science - Web Scraping

## Tasks Today:

1) <b>Requests</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Requests <br>
2) <b>Beautiful Soup</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) .prettify() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Converting to a List <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Extracting Beautiful Soup Elements <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Assigning Variables from Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) .find() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) .find_all() <br>
3) <b>Exercise</b> <br>

## Requests

In [1]:
# Install Beautiful Soup
!pip install beautifulsoup4
!pip install httpx



### Importing

In [2]:
import httpx

### Using Requests

In [3]:
# Connect to URL
page = httpx.get('https://www.arthurleej.com/e-love.html')

In [4]:
# display result response
page

<Response [200 OK]>

##### .content()

In [5]:
# Check Status of request response
page.content

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">\r<html>\r<head>\r\t<title>Essay on Love by Arthur Lee Jacobson</title>\r<meta name="description" content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson.">\r<meta name="keywords" content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington">\r<meta name="resource-type" content="document">\r<meta name="generator" content="BBEdit 4.5">\r<meta name="robots" content="all">\r<meta name="classification" content="Gardening">\r<meta name="distribution" content="global">\r<meta name="rating" content="general">\r<meta name="copyright" content="2001 Arthur Lee Jacobson">\r<meta name="author" content="eriktyme@eriktyme.com">\r<meta name="language" content="en-us">\r</head>\r<body background="images/background1a.jpg" bgcolor="#FFFFCC" text="#000000" link="#00

## Beautiful Soup

### Importing

In [6]:
from bs4 import BeautifulSoup

### Using Beautiful Soup

In [7]:
# Instantiate BeautifulSoup class
soup = BeautifulSoup(page.content, 'html.parser')

soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
 <html> <head> <title>Essay on Love by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000FF" t

### .prettify()

In [8]:
#NOTE: Prettify only works for the full document and the .find() method
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Essay on Love by Arthur Lee Jacobson
  </title>
  <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/>
  <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/>
  <meta content="document" name="resource-type"/>
  <meta content="BBEdit 4.5" name="generator"/>
  <meta content="all" name="robots"/>
  <meta content="Gardening" name="classification"/>
  <meta content="global" name="distribution"/>
  <meta content="general" name="rating"/>
  <meta content="2001 Arthur Lee Jacobson" name="copyright"/>
  <meta content="eriktyme@eriktyme.com" name="author"/>
  <meta content="en-us" name="language"/>
 </head>
 <body alink="#33CC33" background="images/background1a.jpg" b

### Converting to a List

In [9]:
# Tags may contain strings and other tags. These elements are the tag’s children.
list(soup.children)

# print(len(list(soup.children)))

['HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"',
 ' ',
 <html> <head> <title>Essay on Love by Arthur Lee Jacobson</title> <meta content="Trees,gardening, wild and domestic plant life are the specialty of author Arthur Lee Jacobson." name="description"/> <meta content="trees, gardening, wild plants, domestic plants, gardening author, gardening books, Arthur Lee Jacobson, A L J, A L Jacobson, Arthur Jacobson, arthur lee, plants, flowers, seattle, washington" name="keywords"/> <meta content="document" name="resource-type"/> <meta content="BBEdit 4.5" name="generator"/> <meta content="all" name="robots"/> <meta content="Gardening" name="classification"/> <meta content="global" name="distribution"/> <meta content="general" name="rating"/> <meta content="2001 Arthur Lee Jacobson" name="copyright"/> <meta content="eriktyme@eriktyme.com" name="author"/> <meta content="en-us" name="language"/> </head> <body alink="#33CC33" background="images/background1a.jpg" bgcolor="#FFFFCC" link="#0000FF" te

### Extracting Beautiful Soup Elements

In [10]:
# We can traverse through an HTML page and extract other tags and text
# The below example shows the types of iterables available in the object created from the HTML Document
# .Tag allows us to dive deeper into the document i.e we can look for HTML attributes like .class and if needed go deeper into the document from there
[type(item) for item in list(soup.children)]

[bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

### Assinging Variables from Beautiful Soup

In [11]:
# import pprint


html = list(soup.children)[2] # Selecting the HTML element child from the soup object
body = list(html.children)[3] # Selecting the body element from the HTML children
center = list(body.children)[4]
table = list(center.children)[0]

print(table.prettify())

<table border="0" cellpadding="1" cellspacing="2">
 <tr>
  <td align="center" valign="top" width="480">
   <table border="0" cellpadding="1" cellspacing="2">
    <tr>
     <td align="center" valign="top" width="480">
      <font size="5">
       <b>
        Love
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?
       </b>
      </font>
     </td>
    </tr>
    <tr>
     <td align="left" valign="top" width="480">
      <font size="3">
       <b>
        Love in its broad sense is the feeling of strong attraction, and

### .find() <br>
<p>Find a specific instance of the parameter passed in</p>

In [12]:
table.find('b')

<b>Love</b>

### .find_all() <br>
<p>Similar to .find(), except this will return all of them instead of one</p>

In [13]:
text_body = []

for b in table.find_all('b'):
    text_body.append(b.text)
    
text_body

['Love',
 "\xa0\xa0\xa0\xa0Of the fourteen essays I'm writing, only this one treats an emotion. That love is the most important emotion is the deduction. I think other emotions may be as important, but are not so powerfully moving or interesting to most of us. Love is exciting. There is no need to justify choosing to write about it. Are not most songs love songs? Are not most novels stories featuring love?",
 '\xa0\xa0\xa0\xa0Love in its broad sense is the feeling of strong attraction, and often attachment and protection. It is felt towards other people, towards pets, towards inanimate objects, towards abstractions such as patriotism, religious matters, hobbies, and I suppose nearly everything. It is multifaceted, and includes ordinary self-love, chivalrous love, carnal or sexual love, friendly love, family love. It is an emotion that is closely related to certain others, such as hope. At its simplest level it is what we strongly like.',
 "\xa0\xa0\xa0\xa0I have a hunch that love, like

## Exercise <br>
<p>Using the Beautiful Soup library, grab the data from the following link: https://www.nbastuffer.com/2019-2020-nba-player-stats/. After getting the data, display the players name and team inside of a pandas dataframe.</p>

In [14]:
# Hint: Use the .get_text() method

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd



# Selenium

In [15]:
import os
import sys
os.path.dirname(sys.executable)

'C:\\Users\\bstan\\anaconda3\\envs\\intro'

In [16]:
!pip install selenium



In [17]:
from selenium import webdriver
from time import sleep



In [18]:
from selenium.webdriver.common.keys import Keys

In [19]:
driver = webdriver.Chrome()
driver.get('https://kekambas-bs.herokuapp.com/login')

username = driver.find_element_by_name('username')
username.clear()
username.send_keys('bstanton')

password = driver.find_element_by_name('password')
password.clear()
password.send_keys('pass')
password.send_keys(Keys.RETURN)

# submit_btn = driver.find_element_by_id('submit')
# submit_btn.click()

driver.get('https://kekambas-bs.herokuapp.com/createpost')

title = driver.find_element_by_name('title')
title.clear()
title.send_keys('Doing another bot test')

content = driver.find_element_by_name('content')
content.clear()
content.send_keys('Beep boop bop, I am a robot')


submit_btn = driver.find_element_by_id('submit')
submit_btn.click()

driver.get('https://kekambas-bs.herokuapp.com')



sleep(7)

driver.close()

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 91
Current browser version is 95.0.4638.69 with binary path C:\Program Files\Google\Chrome\Application\chrome.exe
