# Data Science - Web Scraping

## Tasks Today:

1) <b>Requests</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Requests <br>
2) <b>Beautiful Soup</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Using Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) .prettify() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Converting to a List <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Extracting Beautiful Soup Elements <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Assigning Variables from Beautiful Soup <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) .find() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) .find_all() <br>
3) <b>Exercise</b> <br>

## Requests

### Importing

In [1]:
import requests

### Using Requests

In [2]:
page = requests.get('http://www.arthurleej.com/e-love.html')

In [3]:
# display result response
page

<Response [200]>

##### .content()

In [None]:
page.content

## Beautiful Soup

### Importing

In [None]:
from bs4 import BeautifulSoup

### Using Beautiful Soup

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

### .prettify()

In [None]:
 #NOTE: Prettify only works for the full document and the .find() method
print(soup.prettify())

### Converting to a List

In [None]:
 # Tags may contain strings and other tags. These elements are the tag’s children.
print(list(soup.children))

html_children = list(soup.children)
html_children[3]

### Extracting Beautiful Soup Elements

In [None]:
# We can traverse through an HTML page and extract other tags and text
# The below example shows the types of iterables available in the object created from the HTML Document
# .Tag allows us to dive deeper into the document i.e we can look for HTML attributes like .class and if needed go deeper into the document from there
[type(item) for item in list(soup.children)]

### Assinging Variables from Beautiful Soup

In [None]:
import pprint

html = list(soup.children)[2] # Selecting the HTML element child from the soup object
body = list(html.children)[3]#Selecting the body from the HTML child
center = list(body.children)[4]# Selecting a subset of body
table = list(center.children)[0]#Selecting a table from the subset of body

print(table)
# for element in list(center.children):
#     print('\n\n')
#     print(element)

### .find() <br>
<p>Find a specific instance of the parameter passed in</p>

In [None]:
soup.find('b')

### .find_all() <br>
<p>Similar to .find(), except this will return all of them instead of one</p>

In [None]:
soup.find_all(['td','b'])

## Exercise <br>
<p>Using the Beautiful Soup library, grab the data from the following link: https://www.baseball-reference.com/teams/BOS/batteam.shtml. After getting the data, display only the year and batting average for each year (2017: .276). Lastly, plot the data on a preferred matplotlib chart.</p>

In [None]:
# Hint: Use the .get_text() method

import matplotlib.pyplot as plt
import numpy as np
import collections

page = requests.get('https://www.baseball-reference.com/teams/BOS/batteam.shtml')
# print(page.status_code)

soup = BeautifulSoup(page.content, 'html.parser') # Grab the HTML

all_years_ba = soup.find_all('td', attrs={'data-stat': 'batting_avg'})
years = soup.find_all('th', attrs = {'data-stat':'year_ID'})


years.pop(0)

b_avg = {}
# Our output will be => {2019: .275}

# adding year as key and avg as it's value to b_avg dictionary
for index in range(len(all_years_ba)):
    year = int(years[index].get_text()) # .get_text() will get the textual data inside of the HTML tags
    avg = float(all_years_ba[index].get_text())
    b_avg[year] = avg
    if index == 20:
        break
        
print(b_avg)

b_avg_ordered = collections.OrderedDict(sorted(b_avg.items()))
print(b_avg_ordered)

In [None]:
# MLB averages

page2 = requests.get('https://www.baseball-reference.com/leagues/MLB/bat.shtml')
# print(page.status_code)



In [None]:
plt.figure(figsize=(20,10))

plt.plot([x for x in b_avg.keys()], [y for y in b_avg.values()], 'ro-', label='Boston Red Sox')
plt.xlabel('Year')
plt.ylabel('Batting Average')
plt.legend()
plt.title('Year-By-Year Team Batting Average')
plt.xticks(np.arange(min([x for x in b_avg.keys()]), max([y for y in b_avg.keys()]) + 1, 2.0))

plt.show()

In [None]:
print(bavg_mlb)

# Bonus Example: Pulling Vegas Odds from PFR.com

<h3> Use this example for further reference</h3>
<p> This is an example that shows what we will get returned back to us when accessing a HTML document with Beautiful Soup</p>

In [None]:
page = requests.get('https://www.pro-football-reference.com/boxscores/201810140nwe.htm')
# print(page.status_code)

soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
print(soup.prettify())

In [None]:
for section in list(soup.children):
    print(section)
    print('1\n2\n3\n')

In [None]:
html = list(soup.children)[3]

html

In [None]:
body = list(html.children)[3]

for el in list(body.children):
    print(el)
    print('\n\n\n\n123\n\n\n\n')

In [None]:
table = body.find_all('div')

print(table)

In [None]:
from bs4 import Comment

comments=soup.find_all(string=lambda text:isinstance(text,Comment))

for comment in comments:
    comment=BeautifulSoup(str(comment))
    log = comment.find('table', {'id':'game_info'}) #search as ordinary tag
    if log:
        print(log)