#### Day 4: Web Scraping + File I/O

##### Today's Topics:
1. Urllib and Beautiful Soup
2. Selenium
3. File Input/Output

* This is likely the most important day in the course (along with day05 on APIs).  
* You will use all the modules here if you want to scrape the internet.

***

### Part 1: Web Scraping (without APIs)

Web scraping is the art of extracting data from websites and delivering it in formats like JSON, CSV, HTML, PDF, etc.

Web scraping can be done either by using coding languages like Python, or by using data extraction APIs (Day 5).

##### Benefits 

1. Time-saving
2. Data accuracy
3. Cost-effective 


##### Ethics 

- Use a Public API when available and avoid scraping all together if the data you are looking if available through the API
- Only scrape when it is legal! 
    - NOT all sites can be legally scraped. Please don't get sued. 
    - Always check terms of service.
    - When in doubt, ask or don't do it. 
- Be polite and don't break websites
    - Scrape your data at a reasonable rate and control the number of requests per second. 
    - You don't want the website owner to think it as a DDoS attack. 


##### Overview of Web Scraping (without APIs)

1. Call the website and open it
2. Extract or load all the html code (you can store it locally for later use)
3. Retrieve information using the names of the tags, ids, etc. 
4. Store the data in to files (like csv)

##### 1.1 The Skeleton HTML Layout

In [None]:
# <!DOCTYPE html> <html>
# <head>
# <title> Page Title </title>
# </head>
# <body>

# <h1>My first heading </h1>
# <p>My first paragraph. </p>

# </body> 
# </html>

_See https://www.w3schools.com/tags/default.asp for a list wih HTML tags_

Now go to https://polisci.wustl.edu/people/88/ 

Click right, then View Page Source or (more likely) Inspect

##### 1.2 Web Crawlers
We mainly use two libraries: urllib and BeautifulSoup

1. urllib:
    - web crawler 
    - navigates to an url
2. BeautifulSoup
    - parses a downloaded HTML

Useful when:
- Info is contained in HTML (not served by JavaScript)
- Encoded HTML follows predictable pattern
- Example: https://www.presidency.ucsb.edu/documents/app-categories/presidential


Beautiful Soup documentation: 
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
# Run the line below if not installed alreay
# !pip3 install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup
import urllib.request

Example (WUSTL Political Science Webpage):

1. Open a web page

In [None]:
web_address = 'https://polisci.wustl.edu/people/88/'
web_page = urllib.request.urlopen(web_address)
web_page #stored on machine

2. Parse it

In [None]:
soup = BeautifulSoup(web_page.read())
# print(soup)
print(soup.prettify()) # enable us to view how tags are nested in the document

3. Find all cases of a certain tag 'a'

In [None]:
soup.find_all('a') # Returns a list... remember this!


4. Find all cases of a certain tag 'h3'

In [None]:
soup.find_all('h3')

5. Extract text from the tag

In [None]:
names = soup.find_all('h3') # list of html entries
[i.text for i in names] # grab just the text from each one

In [None]:
# We can get all elements with the tag 'a.' Then, get the attributes
all_a_tags = soup.find_all('a')
# all_a_tags
all_a_tags[36].attrs  # returns a dictionary with the attributes

In [None]:
all_a_tags[36]['href']

In [None]:
all_a_tags[36]['class']

In [None]:
for i in range(34,40):
  print(all_a_tags[i]['href'])

In [None]:
# Careful for the first and last tags
all_a_tags[0].attrs

In [None]:
# Note: because all_a_tags is a list, we need to index the element.
# If we are interested in the first instance of the tag 'a,' we can use
soup.find('a')

In [None]:
soup.find('a').attrs 

We can use a loop (for or while) to get all the data.

In [None]:
l = {"class" : [], "href" : []} # create a dictionary
for p in range(20,43):
    l["class"].append(all_a_tags[p].attrs["class"]) 
    l["href"].append(all_a_tags[p].attrs["href"]) 

print(l)

We can check all the attrs, using the `.keys()` method

In [None]:
all_a_tags[36].attrs.keys()

In [None]:
all_a_tags[36]['href']
# all_a_tags[36]['class']
# all_a_tags[1]['class']

In [None]:
# If we are interested only in the attributes 'class' and 'card' 
# nested within tag 'a', we can specify this in our first call:
soup.find_all('a', {'class' : "card"}) # returns a list

It is very common that you will need to go level by level to access nested tags.

Here is an example: 

In [None]:
sections = soup.find_all('div') # get all tags 'div'
len(sections) # check the size of the object

In [None]:
sections[2].a # FIRST 'a' tag within the 'div' tag or equivalently: 

In [None]:
sections[2].find('a') # FIRST 'a' tag within the 'div' tag

In [None]:
sections[2].find_all('a') ## ALL 'a' tags within the 'div' tag

In [None]:
sections[2].find_all('a', {'class' : 'first-level'}) ## ALL 'a' tags within the 'div' tag where 'class' is 'first-level'

We can create a tree of objects. Here is an example: 

Let's find Prof. Taylor Carlson's profile on the department website. 

1. Find all 'a' tags where 'class' is 'card'

In [None]:
all_people = soup.find_all('a', {'class' : "card"})
all_people

2. Manually examine where Prof. Carlson is located at. 

In [None]:
taylor = all_people[4]
taylor

3. Find the heading that contains Prof. Carlson's first and last name.

In [None]:
taylor.find_all('h3')
# taylor.find('h3').text

4. Check the contents contained within this 'a' tag for Prof. Carlson. 
Notice that this is basically the same output as above, but without the <a></a> tags. So it is returning everything nested within the 'a' tag.

In [None]:
taylor.contents

In [None]:
taylor.children # This is an iterator. Remember: iterators are objects that we use in loops

5. Print all nested elements within 'taylor'

In [None]:
for i, child in enumerate(taylor.children):
    print("Child %d: %s" % (i,child), '\n') # there is only one child element in this case

Let's now look at sibling tags of 'taylor'

In [None]:
# Siblings (Example):

# <html>
#   <body>
#       <a>
#         <b>
#          text1
#         </b>
#         <c>
#          text2
#         </c>
#       </a>
#   </body>
# </html>


# Which two tags are on the same level? 

In [None]:
for sib in taylor.next_siblings:
  print(sib)

In [None]:
# Or the previous instance
for sib in taylor.previous_siblings:
  print(sib)
# What is happening?

##### 1.3 Crawler Detection

Crawlers are incredibly fast, but also easier to detect and block. 

You can incorporate some pauses to avoid detection. 

1. Use random number generator to sleep for a random number of seconds
2. After each iteration, sleep for a fixed number of seconds

In [17]:
import random
import time

# Script will pause for a n seconds
time.sleep(random.uniform(1, 5))
print('Pause Ended')

Pause Ended


In [None]:
time.sleep(5)
print('done')

#### 1.4 Remote Driver

Selenium is a “remote driver” of your favorite browser. 

Therefore, you can pretty much simulate behavior of a human “surfing the web”. 

With the right tricks, the likelihood of tracking and blocking your “bot” decreases.

It also offers flexibility in terms of “unknown” items: you can even look by name of buttons in the page. 

There are some downsides though...
  - It is slower
  - It is dependent on your internet connection quality

Here is an example using Selenium: 

`pip3 install selenium` run this in terminal or command line if not installed

download appropriate web driver from browser, e.g. https://chromedriver.chromium.org/downloads


In [18]:
from selenium import webdriver
# from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
# from selenium.webdriver.common.keys import Keys

1. Give the path to your driver.

In [None]:
# Interactive example:
driver_path = Service('/Users/ysui/Desktop/PhD/MTE/pythoncamp2023_prep/Day04/Lecture/chromedriver')
driver = webdriver.Chrome(service = driver_path)

2. Start the web driver

In [None]:
driver.get('https://www.google.com')

3. Find the search element and enter text

In [None]:
search = driver.find_element("name", "q")
search.send_keys('WUSTL Political Science')

4. press Enter / Return (simulate this action using your driver)

In [None]:
search.submit()

5. Close the browser (make sure to always close your browser after web scraping)

In [None]:
driver.close()

#### Part 2: Combining Approaches

Let's combine the approaches and scrape some data from the Iceland Parliament! 

In [None]:
# Define Webpage
url = "https://www.althingi.is/altext/cv/en/"
# Use a crawler to get all pages for MPs
web_page = urllib.request.urlopen(url)
# Parse the HTML
soup = BeautifulSoup(web_page.read())#, "html.parser") # html.parser severs as a basis for parsing text files in HTML format
# Get all urls
mps = soup.find('table').find_all('a', href = True)
mps

In [None]:
# Create objects to store the data:
page = []
name = []
party = []
email = []

In [None]:
# run the function for the first 2 cases
for i in range(0, 2):
  print(i)
  page.append(url + mps[i]['href'])
  driver = webdriver.Chrome(service = driver_path)
  driver.get(page[i])
  html = driver.page_source
  driver.close()
  soup = BeautifulSoup(html)
  name.append(soup.find(class_ = 'article box news').find('h1').text)
  soup = soup.find(class_ = 'article box news').find('div', class_ = 'person')
  party.append(soup.find(class_ = 'office').find_all('li')[1].text)
  email.append(soup.find(class_ = 'contactinfo first notexternal').find('a', href = True)['href'].split(":")[1])
  # time.sleep(5)

##### Scraping Tips
- Google Chrome is better to track nodes and page sources
- Inspect the source and get to know your document/website!
- Selenium—Use the ’Copy Xpath’ command if you’re having troubles (Find it in "Inspect" in Google Chrome)
- Use time breaks to avoid being blocked and be polite
- Check the Terms of Service (whether you obey them or not). Please don't get sued. 


##### More on Selenium: https://selenium-python.readthedocs.io/locating-elements.html

### Part 2: Reading and Writing Files 

Reading Files
1. Import libraries

In [None]:
# import sys
import os

2. Set your working directory 

In [None]:
# pwd
os.chdir('/Users/ysui/Desktop/PhD/MTE/pythoncamp2023_prep/Day04/Lecture')

3. Read lines from the file

In [None]:
# Read all lines as one string
with open('readfile.txt') as f:
  the_whole_thing = f.read()
  print(the_whole_thing)

In [None]:
# Read line by line
with open('readfile.txt') as f:
  lines_list = f.readlines()
  for l in lines_list:
    print(l)

In [None]:
# More efficiently, we can loop over the file object (i.e. we don't need the variable lines)
with open('readfile.txt') as f:   
  for l in f:
    print(l)

In [None]:
# We can also manually open and close files
# I never do this
f =  open('readfile.txt')
print(f.read())
f.close()

Tips: 
- Try to minimize the number of times you open and close flies
- It is very expensive and consumes limited resources --> if too many, it leads to errors 

_Source: https://www.geeksforgeeks.org/context-manager-in-python/_


In [None]:
# file_descriptors = [] 
# for x in range(100000000000): 
#     file_descriptors.append(open('readfile.txt')) 

Writing Files
1. Writing files is easy, but be careful not to overwrite the content you actually want
2. See https://stackabuse.com/file-handling-in-python/ for more options

In [None]:
# We need to use the option 'w'
with open('test_writefile.txt', 'w') as f:
  ## wipes the file clean and opens it
  f.write("Hi guys.")
  f.write("Does this go on the second line?")
  f.writelines(['a\n', 'b\n', 'c\n'])

In [None]:
# We use 'a' to append new information to it
with open('test_writefile.txt', 'a') as f:
  f.write("I got appended!")

Writing csv files
1. Import csv

In [None]:
import csv

2. Open a file stream and create a `csv` writer object

In [None]:
# Open a file stream and create a CSV writer object
with open('test_writecsv.csv', 'w') as f:
  my_writer = csv.writer(f)
  for i in range(1, 100):
    my_writer.writerow([i, i-1])

3. Now read the `csv` file

In [12]:
# Now read in the csv
with open('test_writecsv.csv', 'r') as f:
  my_reader = csv.reader(f)
  mydat = []
  for row in my_reader:
    mydat.append(row)
print(mydat)

[['1', '0'], ['2', '1'], ['3', '2'], ['4', '3'], ['5', '4'], ['6', '5'], ['7', '6'], ['8', '7'], ['9', '8'], ['10', '9'], ['11', '10'], ['12', '11'], ['13', '12'], ['14', '13'], ['15', '14'], ['16', '15'], ['17', '16'], ['18', '17'], ['19', '18'], ['20', '19'], ['21', '20'], ['22', '21'], ['23', '22'], ['24', '23'], ['25', '24'], ['26', '25'], ['27', '26'], ['28', '27'], ['29', '28'], ['30', '29'], ['31', '30'], ['32', '31'], ['33', '32'], ['34', '33'], ['35', '34'], ['36', '35'], ['37', '36'], ['38', '37'], ['39', '38'], ['40', '39'], ['41', '40'], ['42', '41'], ['43', '42'], ['44', '43'], ['45', '44'], ['46', '45'], ['47', '46'], ['48', '47'], ['49', '48'], ['50', '49'], ['51', '50'], ['52', '51'], ['53', '52'], ['54', '53'], ['55', '54'], ['56', '55'], ['57', '56'], ['58', '57'], ['59', '58'], ['60', '59'], ['61', '60'], ['62', '61'], ['63', '62'], ['64', '63'], ['65', '64'], ['66', '65'], ['67', '66'], ['68', '67'], ['69', '68'], ['70', '69'], ['71', '70'], ['72', '71'], ['73', '72

3. Add column names

In [13]:
with open('test_csvfields.csv', 'w') as f:
  my_writer = csv.DictWriter(f, fieldnames = ("A", "B"))
  my_writer.writeheader()
  for i in range(1, 100):
    my_writer.writerow({"B":i, "A":i-1})

4. Read the new file

In [15]:
with open('test_csvfields.csv', 'r') as f:
  my_reader = csv.DictReader(f)
  for row in my_reader:
    print(row)

{'A': '0', 'B': '1'}
{'A': '1', 'B': '2'}
{'A': '2', 'B': '3'}
{'A': '3', 'B': '4'}
{'A': '4', 'B': '5'}
{'A': '5', 'B': '6'}
{'A': '6', 'B': '7'}
{'A': '7', 'B': '8'}
{'A': '8', 'B': '9'}
{'A': '9', 'B': '10'}
{'A': '10', 'B': '11'}
{'A': '11', 'B': '12'}
{'A': '12', 'B': '13'}
{'A': '13', 'B': '14'}
{'A': '14', 'B': '15'}
{'A': '15', 'B': '16'}
{'A': '16', 'B': '17'}
{'A': '17', 'B': '18'}
{'A': '18', 'B': '19'}
{'A': '19', 'B': '20'}
{'A': '20', 'B': '21'}
{'A': '21', 'B': '22'}
{'A': '22', 'B': '23'}
{'A': '23', 'B': '24'}
{'A': '24', 'B': '25'}
{'A': '25', 'B': '26'}
{'A': '26', 'B': '27'}
{'A': '27', 'B': '28'}
{'A': '28', 'B': '29'}
{'A': '29', 'B': '30'}
{'A': '30', 'B': '31'}
{'A': '31', 'B': '32'}
{'A': '32', 'B': '33'}
{'A': '33', 'B': '34'}
{'A': '34', 'B': '35'}
{'A': '35', 'B': '36'}
{'A': '36', 'B': '37'}
{'A': '37', 'B': '38'}
{'A': '38', 'B': '39'}
{'A': '39', 'B': '40'}
{'A': '40', 'B': '41'}
{'A': '41', 'B': '42'}
{'A': '42', 'B': '43'}
{'A': '43', 'B': '44'}
{'A': '

- Tip 1: We may find useful to save webpages for collecting data. (save to `.html` files)

In [22]:
import os

In [23]:
def download_page(address, filename, wait = 5):
  time.sleep(random.uniform(0,wait))
  page = urllib.request.urlopen(address)
  page_content = page.read()
  if os.path.exists(filename) == False:
    with open(filename, 'w') as p_html:
      p_html.write(str(page_content)) # needed to cast as string
  else:
    print("Can't overwrite file " + filename)

download_page('https://polisci.wustl.edu/people/88/', "polisci_ppl.html")

Then, we can parse a page that is already saved on your computer even without access to internet. 

In [26]:
with open('polisci_ppl.html') as f:
  myfile = f.read()
  soup = BeautifulSoup(myfile)
soup.prettify()

'<html>\n <body>\n  <p>\n   b\'\n   <!DOCTYPE html>\n  </p>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n  \\n\n  <link href="https://polisci.wustl.edu/sites/all/themes/olympian/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>\n  \\n\n  <meta content="Drupal 7 (http://drupal.org)" name="generator"/>\n  \\n\n  <link href="https://polisci.wustl.edu/people/88" rel="canonical"/>\n  \\n\n  <link href="https://polisci.wustl.edu/people/88" rel="shortlink"/>\n  \\n\n  <meta content="Department of Political Science" property="og:site_name"/>\n  \\n\n  <meta content="article" property="og:type"/>\n  \\n\n  <meta content="https://polisci.wustl.edu/people/88" property="og:url"/>\n  \\n\n  <meta content="Faculty" property="og:title"/>\n  \\n\n  <meta content="Faculty" itemprop="name"/>\n  \\n\n  <title>\n   Faculty | Department of Political Science\n  </title>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <meta content=" De

- Tip 2: You may also write directly from a website to a `csv` file. This is good practice as it ensures a break 10 hours into the process does not erase all of your data. 
- Tip 3: Use Exception Handling techniques that we covered in Day03

In [27]:
with open('iceland_test.csv', 'w') as f: # set up with the writer
  w = csv.DictWriter(f, fieldnames = ("name", "party", "phone")) # define column names
  w.writeheader() # write the header
  web_address='https://www.althingi.is/altext/cv/en/' # the web address
  web_page = urllib.request.urlopen(web_address) # open the web page
  soup = BeautifulSoup(web_page.read()) # soup the web page
  all_members = soup.find_all('tr') # find the list of names and parties
  for i in range(1,3): # for members 1 and 2 (member 0 is just the table heading)
    # you should also add try/except language to ensure a weird item doesn't break your whole scraper
    try:
      member = {} ## empty dictionary to fill in
      member_i = all_members[i].find_all('td') # subset lower to each individual item
      member["name"] = member_i[0].text # member's name
      member['party'] =  member_i[1].text # member's party
      inner_page_url = web_address + member_i[0].a['href'] # get the extension to their personal page
      inner_page = urllib.request.urlopen(inner_page_url) # open the personal page
      inner_soup = BeautifulSoup(inner_page.read()) # soup the personal page
      member['phone'] = inner_soup.find('a', {'class' : 'tel'}).text # get phone number
    except:
      member['name'] = 'NA'
      member['party'] = 'NA'
      member['phone'] = 'NA'
    w.writerow(member) # write the row for this specific member
    time.sleep(random.uniform(1, 5)) # be polite, sleep!

In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.