# Data Scraper with .csv output
Inspired by the YouTube tutorial by Alex The Analyst

- [BeautifulSoup + Requests | Web Scraping in Python](https://www.youtube.com/watch?v=bargNl2WeN4)
- [Find and Find_All | Web Scraping in Python](https://www.youtube.com/watch?v=xjA1HjvmoMY)

## First try with requests and BeautifulSoup

### Import modules

In [30]:
# !pip install bs4
import requests
from bs4 import BeautifulSoup

### Load one page with 'requests' from URL and check the document

In [31]:
url = "https://www.scrapethissite.com/pages/forms/"
response = requests.get(url)
#response.text

### Create the BeautifulSoup object and check the document

In [32]:
soup = BeautifulSoup(response.text, "html")
#soup

### Prettify the document data

In [33]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
  <meta con

### Try to fetch the title

In [34]:
soup.find("title").text

'Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping'

### Try to read the pagination links

In [35]:
pagination = soup.find("ul", class_="pagination")
for anker in pagination.find_all("a"):
    #print(url + "?" + anker["href"].split("?")[1])
    ...

### Try to read the table

In [36]:
table = soup.find("table", class_="table")
rows = table.find_all("tr")
#print(len(rows))

for row in rows:
    ths = row.find_all("th")
    tds = row.find_all("td")

    print([th.text.strip() for th in list(ths)])
    print("|".join([td.text.strip() for td in tds]))

['Team Name', 'Year', 'Wins', 'Losses', 'OT Losses', 'Win %', 'Goals For (GF)', 'Goals Against (GA)', '+ / -']

[]
Boston Bruins|1990|44|24||0.55|299|264|35
[]
Buffalo Sabres|1990|31|30||0.388|292|278|14
[]
Calgary Flames|1990|46|26||0.575|344|263|81
[]
Chicago Blackhawks|1990|49|23||0.613|284|211|73
[]
Detroit Red Wings|1990|34|38||0.425|273|298|-25
[]
Edmonton Oilers|1990|37|37||0.463|272|272|0
[]
Hartford Whalers|1990|31|38||0.388|238|276|-38
[]
Los Angeles Kings|1990|46|24||0.575|340|254|86
[]
Minnesota North Stars|1990|27|39||0.338|256|266|-10
[]
Montreal Canadiens|1990|39|30||0.487|273|249|24
[]
New Jersey Devils|1990|32|33||0.4|272|264|8
[]
New York Islanders|1990|25|45||0.312|223|290|-67
[]
New York Rangers|1990|36|31||0.45|297|265|32
[]
Philadelphia Flyers|1990|33|37||0.412|252|267|-15
[]
Pittsburgh Penguins|1990|41|33||0.512|342|305|37
[]
Quebec Nordiques|1990|16|50||0.2|236|354|-118
[]
St. Louis Blues|1990|47|22||0.588|310|250|60
[]
Toronto Maple Leafs|1990|23|46||0.287|241|

## Let's combine all the parts into our complete scraper

### Import the modules again

In [37]:
import requests
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

### Configure the URL to load and the output filename

In [38]:
output_filename = "data/hockey.csv"

url = "https://www.scrapethissite.com/pages/forms/"
pagination = None

headers = []
data = []

while True:
    print("Load page:", url)
    response = requests.get(url)

    if 200 != response.status_code:  # early exit on error
        break

    soup = BeautifulSoup(response.text, "html")  # Create the BeautifulSoup object

    # should only run once at the first iteration
    if pagination == None:  
        ul_pagination = soup.find("ul", class_="pagination")
        pagination = [url + "?" + anker["href"].split("?")[1] for anker in ul_pagination.find_all("a")][1:-1]

    # read table and rows
    table = soup.find("table", class_="table")
    rows = table.find_all("tr")

    # loop over all rows and extract headers (once) and data
    for row in rows:
        # should also only run once at the first iteration
        ths = row.find_all("th")
        if ths and not headers:
            headers = [th.text.strip() for th in list(ths)]
            continue

        # read all colums in a list and append it to data
        tds = row.find_all("td")
        if tds:  # prevent empty rows
            data.append([td.text.strip() for td in tds])

    # load the next page
    if not pagination:
        print("No more pages! Have a nice day!")
        break
 
    print("Take a little nap!")
    time.sleep(1.5)  # the server owner allows only one request per second 
    url = pagination.pop(0)


df = pd.DataFrame(data)  # Create a Pandas DataFrame with collected data
df.columns = headers  # Set collected headers as column names

print(f"Write to CSV file ({output_filename})")
df.to_csv(output_filename, index=False, header=True)

Load page: https://www.scrapethissite.com/pages/forms/
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=2
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=3
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=4
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=5
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=6
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=7
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=8
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=9
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=10
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=11
Take a little nap!
Load page: https://www.scrapethissite.com/pages/forms/?page_num=12
Take a 