# Web Scraping + File I/O

##### Today's Topics:
1. Urllib and Beautiful Soup
2. Selenium
3. File Input/Output

* This is likely the most important day in the course (along with day05 on APIs).  
* You will use all the modules here if you want to scrape the internet.

***

### 1. Web Scraping (without APIs)

* Web scraping is the art of extracting data from websites and delivering it in formats like JSON, CSV, HTML, PDF, etc.
* Web scraping can be done either by using coding languages like Python, or by using data extraction APIs (Day 5).

##### Benefits 

1. Time-saving
2. Data accuracy
3. Cost-effective 

##### Ethics 

- Use a Public API when available and avoid scraping all together if the data you are looking if available through the API
- Only scrape when it is legal! 
    - NOT all sites can be legally scraped. Please don't get sued. 
    - Always check terms of service.
    - When in doubt, ask or don't do it. 
- Be polite and don't break websites
    - Scrape your data at a reasonable rate and control the number of requests per second. 
    - You don't want the website owner to think it as a DDoS attack. 

##### Overview of Web Scraping (without APIs)

1. Call the website and open it
2. Extract or load all the html code (you can store it locally for later use)
3. Retrieve information using the names of the tags, ids, etc. 
4. Store the data in to files (like csv)

#### 1.1 The Skeleton HTML Layout

In [4]:
# <!DOCTYPE html> <html>
# <head>
# <title> Page Title </title>
# </head>
# <body>

# <h1>My first heading </h1>
# <p>My first paragraph. </p>

# </body> 
# </html>

_See https://www.w3schools.com/tags/default.asp for a list of HTML tags_

##### Let's look at some source code!

* Now go to https://polisci.wustl.edu/people/88/ 
* Click right, then View Page Source or (more likely) Inspect

##### 

#### 1.2 Web Crawlers

##### We mainly use two libraries: `urllib` and `BeautifulSoup`

1. `urllib`:
    - web crawler 
    - navigates to a url
2. `BeautifulSoup`
    - parses a downloaded HTML


Useful when:
- Info is contained in HTML (not served by JavaScript)
- Encoded HTML follows predictable pattern
- Example: https://www.presidency.ucsb.edu/documents/app-categories/presidential

Beautiful Soup documentation: 
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

* Run the line below in a Jupyter cell if not installed alreay

In [5]:
# ! pip install beautifulsoup4

In [3]:
from bs4 import BeautifulSoup
import urllib.request

#### 1.3 Example (WUSTL Political Science Webpage):

1. Open a web page

In [47]:
#web_address = 'https://polisci.wustl.edu/people'
#web_page = urllib.request.urlopen(web_address)
#web_page #stored on machine

import ssl
web_address = 'https://polisci.wustl.edu/people'
context = ssl._create_unverified_context()
web_page = urllib.request.urlopen(web_address, context=context)
web_page  # stored on machine

<http.client.HTTPResponse at 0x10f5f74f0>

* Try these alternative lines if the above didn't run

In [1]:
import requests
web_address = 'https://polisci.wustl.edu/people'
response = requests.get(web_address, verify=False)
web_page = response.text
web_page  # HTML content as string



'\n<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="og: https://ogp.me/ns#" class="no-js">\n  <head>\n    <meta charset="utf-8" />\n<noscript><style>form.antibot * :not(.antibot-message) { display: none !important; }</style>\n</noscript><style>/* @see https://github.com/aFarkas/lazysizes#broken-image-symbol */.js img.lazyload:not([src]) { visibility: hidden; }/* @see https://github.com/aFarkas/lazysizes#automatically-setting-the-sizes-attribute */.js img.lazyloaded[data-sizes=auto] { display: block; width: 100%; }/* Transition effect. */.js .lazyload, .js .lazyloading { opacity: 0; }.js .lazyloaded { opacity: 1; -webkit-transition: opacity 2000ms; transition: opacity 2000ms; }</style>\n<meta name="description" content="Description" />\n<link rel="shortlink" href="https://polisci.wustl.edu/node/50" />\n<link rel="canonical" href="https://polisci.wustl.edu/people" />\n<meta name="robots" content="index, follow" />\n<meta name="author" content="asdrupal" />\n<meta name="generator" conte

2. Parse it

In [5]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(web_page, 'html.parser')
# print(soup)
# print(soup.prettify()) # enable us to view how tags are nested in the document

In [6]:
 str(soup.prettify()) # enable us to view how tags are nested in the document

'<!DOCTYPE html>\n<html class="no-js" dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">\n <head>\n  <meta charset="utf-8"/>\n  <noscript>\n   <style>\n    form.antibot * :not(.antibot-message) { display: none !important; }\n   </style>\n  </noscript>\n  <style>\n   /* @see https://github.com/aFarkas/lazysizes#broken-image-symbol */.js img.lazyload:not([src]) { visibility: hidden; }/* @see https://github.com/aFarkas/lazysizes#automatically-setting-the-sizes-attribute */.js img.lazyloaded[data-sizes=auto] { display: block; width: 100%; }/* Transition effect. */.js .lazyload, .js .lazyloading { opacity: 0; }.js .lazyloaded { opacity: 1; -webkit-transition: opacity 2000ms; transition: opacity 2000ms; }\n  </style>\n  <meta content="Description" name="description"/>\n  <link href="https://polisci.wustl.edu/node/50" rel="shortlink"/>\n  <link href="https://polisci.wustl.edu/people" rel="canonical"/>\n  <meta content="index, follow" name="robots"/>\n  <meta content="asdrupal" name="author"

In [7]:
str(soup.prettify())[0:1500]

'<!DOCTYPE html>\n<html class="no-js" dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">\n <head>\n  <meta charset="utf-8"/>\n  <noscript>\n   <style>\n    form.antibot * :not(.antibot-message) { display: none !important; }\n   </style>\n  </noscript>\n  <style>\n   /* @see https://github.com/aFarkas/lazysizes#broken-image-symbol */.js img.lazyload:not([src]) { visibility: hidden; }/* @see https://github.com/aFarkas/lazysizes#automatically-setting-the-sizes-attribute */.js img.lazyloaded[data-sizes=auto] { display: block; width: 100%; }/* Transition effect. */.js .lazyload, .js .lazyloading { opacity: 0; }.js .lazyloaded { opacity: 1; -webkit-transition: opacity 2000ms; transition: opacity 2000ms; }\n  </style>\n  <meta content="Description" name="description"/>\n  <link href="https://polisci.wustl.edu/node/50" rel="shortlink"/>\n  <link href="https://polisci.wustl.edu/people" rel="canonical"/>\n  <meta content="index, follow" name="robots"/>\n  <meta content="asdrupal" name="author"

3. Find all cases of a certain tag 'a'

In [8]:
soup.find_all('a')[2:10] # Returns a list... remember this!

[<a alt=" Department of Political Science  front page" href="/"> Department of Political Science </a>,
 <a data-drupal-link-system-path="node/13357" href="/undergraduate">Undergraduate Program</a>,
 <a data-drupal-link-system-path="node/13340" href="/graduate-program">Graduate Program</a>,
 <a data-drupal-link-system-path="node/14467" href="/masters-degree-statistics-political-science-phd-students">Master’s Degree in Statistics for Political Science Ph.D. Students</a>,
 <a data-drupal-link-system-path="node/13198" href="/research">Research</a>,
 <a aria-current="page" class="is-active" data-drupal-link-system-path="node/50" href="/people">Our People</a>,
 <a data-drupal-link-system-path="node/61" href="/resources">Resources</a>,
 <a href="https://gifts.wustl.edu/index.html?other_designation_description=Political%20Science">Make a Gift to Political Science</a>]

4. Find all cases of a certain tag `<h3>`

In [9]:
soup.find_all('h3')[2:12]

[<h3>
 <div><span>Federico</span></div>
 <div><span>Acosta y Lara</span></div>
 </h3>,
 <h3>
 <div><span>Deniz</span></div>
 <div><span>Aksoy</span></div>
 </h3>,
 <h3>
 <div><span>Lukas</span></div>
 <div><span>Alexander</span></div>
 </h3>,
 <h3>
 <div><span>Alex</span></div>
 <div><span>Avery</span></div>
 </h3>,
 <h3>
 <div><span>Timm</span></div>
 <div><span>Betz</span></div>
 </h3>,
 <h3>
 <div><span>Zachary</span></div>
 <div><span>Bowersox</span></div>
 </h3>,
 <h3>
 <div><span>Christina L.</span></div>
 <div><span>Boyd</span></div>
 </h3>,
 <h3>
 <div><span>Ryan</span></div>
 <div><span>Burge</span></div>
 </h3>,
 <h3>
 <div><span>Daniel </span></div>
 <div><span>Butler</span></div>
 </h3>,
 <h3>
 <div><span>Anthony</span></div>
 <div><span>Buzzacco</span></div>
 </h3>]

5. Extract text from the tag

In [10]:
names = soup.find_all('h3') # list of html entries
[i.text for i in names][2:10] # grab just the text from each one

['\nFederico\nAcosta y Lara\n',
 '\nDeniz\nAksoy\n',
 '\nLukas\nAlexander\n',
 '\nAlex\nAvery\n',
 '\nTimm\nBetz\n',
 '\nZachary\nBowersox\n',
 '\nChristina L.\nBoyd\n',
 '\nRyan\nBurge\n']

* We can create an object containing all elements with the tag `<a>`. Then, get the attributes

In [11]:
all_a_tags = soup.find_all('a')
# all_a_tags
all_a_tags[36].attrs  # returns a dictionary with the attributes

{'href': '/people/dino-p-christenson',
 'id': 'faculty-card-container',
 'class': ['card'],
 'aria-label': 'View Dino P. Christenson'}

* Access the attributes with key-value syntax

In [12]:
all_a_tags[36].attrs.keys()

dict_keys(['href', 'id', 'class', 'aria-label'])

In [13]:
all_a_tags[36]['href']

'/people/dino-p-christenson'

In [14]:
all_a_tags[36]['class']

['card']

In [15]:
for i in range(34,40):
  print(all_a_tags[i]['href'])

/people/amaan-charaniya
/people/tian-chen
/people/dino-p-christenson
/people/leticia-claro-oliveira
/people/brian-f-crisp
/people/bowen-damask


##### Some notes

*  Careful for the first and last tags—these can often be different than the others

In [16]:
all_a_tags[0].attrs

{'href': '#main-content',
 'class': ['visually-hidden', 'focusable', 'skip-link'],
 'role': 'link',
 'aria-label': 'skip to main content'}

* Because `all_a_tags` is a list, we need to index the element(s) we're interested in
* If we are interested in the first instance of the tag `<a>`, we can use

In [17]:
soup.find('a')

<a aria-label="skip to main content" class="visually-hidden focusable skip-link" href="#main-content" role="link">
    Skip to main content
  </a>

In [18]:
soup.find('a').attrs 

{'href': '#main-content',
 'class': ['visually-hidden', 'focusable', 'skip-link'],
 'role': 'link',
 'aria-label': 'skip to main content'}

We can use a loop (for or while) to get and re-organize all the data.

In [19]:
l = {"class" : [], "href" : []} # create a dictionary
for p in range(20,43):
    l["class"].append(all_a_tags[p].attrs["class"]) 
    l["href"].append(all_a_tags[p].attrs["href"]) 

print(l)

{'class': [['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card'], ['card']], 'href': ['/people/deniz-aksoy', '/people/lukas-alexander', '/people/alex-avery', '/people/timm-betz', '/people/zachary-bowersox', '/people/christina-l-boyd', 'https://rap.wustl.edu/people/ryan-burge/', '/people/daniel-butler', '/people/anthony-buzzacco', '/people/randall-calvert', '/people/michael-cannon', '/people/taylor-carlson', '/people/david-carter', '/people/zeynep-ceren-topac', '/people/amaan-charaniya', '/people/tian-chen', '/people/dino-p-christenson', '/people/leticia-claro-oliveira', '/people/brian-f-crisp', '/people/bowen-damask', '/people/niko-dawson', '/people/rex-weiye-deng', '/people/juan-dodyk']}


##### If we are interested only in the attributes of `class = card` nested within tag 'a', we can specify this in our initial `find_all()` call:

In [20]:
soup.find_all('a', {'class' : "card"})[0:2] # returns a list

[<a aria-label="View Federico Acosta y Lara" class="card" href="/people/federico-acosta-y-lara" id="faculty-card-container">
 <article class="faculty-post">
 <div class="image">
 <img alt="Federico Acosta y Lara" height="320" loading="lazy" sizes="(max-width:480px) 85vw, 290px" src="/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/People/Polisci_Acosta%20Y%20Lara_F-1010531_0.jpg.webp" srcset="https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_190w/public/People/Polisci_Acosta%20Y%20Lara_F-1010531_0.jpg.webp 190w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/People/Polisci_Acosta%20Y%20Lara_F-1010531_0.jpg.webp 320w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_480w/public/People/Polisci_Acosta%20Y%20Lara_F-1010531_0.jpg.webp 480w" width="320"/>
 <h3>
 <div><span>Federico</span></div>
 <div><span>Acosta y Lara</span></div>
 </h3>
 </div>
 <div class="dept">
                          

##### Commonly, you will need to go level by level in an exporatory exercise to access nested tags. Here is an example:

* First get all tags `<div>`

In [21]:
sections = soup.find_all('div') 
len(sections) # check the size of the object

268

* View the FIRST `<a>` tag within the first valid `<div>` tag 

In [22]:
sections[2].a

<a class="logo" href="https://artsci.washu.edu">
<picture>
<source media="(min-width: 767px)" srcset="/themes/custom/olympian9/svg/logo-art-sci-vert-color.svg"/>
<img alt="Washington University in St. Louis" src="/themes/custom/olympian9/svg/logo-art-sci-color.svg"/>
</picture>
</a>

* Or, equivalently:

In [23]:
sections[2].find('a') 

<a class="logo" href="https://artsci.washu.edu">
<picture>
<source media="(min-width: 767px)" srcset="/themes/custom/olympian9/svg/logo-art-sci-vert-color.svg"/>
<img alt="Washington University in St. Louis" src="/themes/custom/olympian9/svg/logo-art-sci-color.svg"/>
</picture>
</a>

* This gives us ALL `<a>` tags within the first valid `<div>` tag 

In [24]:
sections[2].find_all('a')[0:10] 

[<a class="logo" href="https://artsci.washu.edu">
 <picture>
 <source media="(min-width: 767px)" srcset="/themes/custom/olympian9/svg/logo-art-sci-vert-color.svg"/>
 <img alt="Washington University in St. Louis" src="/themes/custom/olympian9/svg/logo-art-sci-color.svg"/>
 </picture>
 </a>,
 <a alt=" Department of Political Science  front page" href="/"> Department of Political Science </a>,
 <a data-drupal-link-system-path="node/13357" href="/undergraduate">Undergraduate Program</a>,
 <a data-drupal-link-system-path="node/13340" href="/graduate-program">Graduate Program</a>,
 <a data-drupal-link-system-path="node/14467" href="/masters-degree-statistics-political-science-phd-students">Master’s Degree in Statistics for Political Science Ph.D. Students</a>,
 <a data-drupal-link-system-path="node/13198" href="/research">Research</a>,
 <a aria-current="page" class="is-active" data-drupal-link-system-path="node/50" href="/people">Our People</a>,
 <a data-drupal-link-system-path="node/61" hre

* This gives us ALL `<a>` tags within the first valid `<div>` tag where `class` is 'first-level'

In [25]:
sections[2].find_all('a', {'class' : 'first-level'}) 

[]

##### We can also create a tree of objects. Here is an example: 

Let's find Prof. Taylor Carlson's profile on the department website. 
1. Find all `<a>` tags where `class` is 'card'

In [26]:
taylor = soup.find("a", {"aria-label": lambda x: x and "Taylor Carlson" in x})
print(taylor)



<a aria-label="View Taylor Carlson" class="card" href="/people/taylor-carlson" id="faculty-card-container">
<article class="faculty-post">
<div class="image">
<img alt="Taylor Carlson" height="320" loading="lazy" sizes="(max-width:480px) 85vw, 290px" src="/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp" srcset="https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_190w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 190w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 320w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_480w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 480w" width="320"/>
<h3>
<div><span>Taylor</span></div>
<div><span>Carlson</span></div>
</h3>
</div>
<div class="dept">
                                Director of Graduate Studies Political Science
                                  Associate Professor

2. Manually examine where Prof. Carlson is located at. 

In [27]:
# all_people = soup.find_all('a', {'class' : "card"})
# taylor = all_people[6]
# taylor

3. Find the heading that contains Prof. Carlson's first and last name.

In [28]:
taylor.find_all('h3')
# taylor.find('h3').text

[<h3>
 <div><span>Taylor</span></div>
 <div><span>Carlson</span></div>
 </h3>]

4. Check the contents contained within this `<a>` tag for Prof. Carlson. 
Notice that this is basically the same output as above, but without the `<a></a>` tags. So it is returning everything nested within the 'a' tag.

In [29]:
taylor.contents

['\n',
 <article class="faculty-post">
 <div class="image">
 <img alt="Taylor Carlson" height="320" loading="lazy" sizes="(max-width:480px) 85vw, 290px" src="/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp" srcset="https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_190w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 190w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 320w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_480w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 480w" width="320"/>
 <h3>
 <div><span>Taylor</span></div>
 <div><span>Carlson</span></div>
 </h3>
 </div>
 <div class="dept">
                                 Director of Graduate Studies Political Science
                                   Associate Professor of Political Science<br>
 Weidenbaum Center Director of Survey Research
                 

* This is an iterator
* Remember: iterators are objects that we access with loops

In [30]:
taylor.children

<generator object Tag.children.<locals>.<genexpr> at 0x114d61900>

5. Print all nested elements within 'taylor'

In [31]:
# there is only one child element in this case
for i, child in enumerate(taylor.children):
    print("Child %d: %s" % (i,child), '\n') 

Child 0: 
 

Child 1: <article class="faculty-post">
<div class="image">
<img alt="Taylor Carlson" height="320" loading="lazy" sizes="(max-width:480px) 85vw, 290px" src="/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp" srcset="https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_190w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 190w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_320w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 320w, https://polisci.wustl.edu/sites/polisci.wustl.edu/files/styles/square_1_1_480w/public/PoliSci_Carlson_T_P1322772_0.jpg.webp 480w" width="320"/>
<h3>
<div><span>Taylor</span></div>
<div><span>Carlson</span></div>
</h3>
</div>
<div class="dept">
                                Director of Graduate Studies Political Science
                                  Associate Professor of Political Science<br>
Weidenbaum Center Director of Survey Research
              

##### Let's now look at sibling tags of `taylor`

In [32]:
# Siblings (Example):

# <html>
#   <body>
#       <a>
#         <b>
#          text1
#         </b>
#         <c>
#          text2
#         </c>
#       </a>
#   </body>
# </html>


# Which two tags are on the same level? 

* See siblings *after* `taylor` in the sequence of `<a>` tags

In [33]:
from bs4 import NavigableString


siblings = [s for s in taylor.next_siblings if not isinstance(s, NavigableString)]
for sib in siblings[:2]:
    print(sib.name, sib.get("aria-label"), sib.get("href"))

a View David Carter /people/david-carter
a View Zeynep Ceren Topac /people/zeynep-ceren-topac


* Or see siblings *before* `taylor` in the sequence of `<a>` tags

In [34]:
siblings2 = [s for s in taylor.previous_siblings if not isinstance(s, NavigableString)]
for sib in [siblings2[3], siblings2[5]]:
    print(sib.name, sib.get("aria-label"), sib.get("href"))

a View Daniel  Butler /people/daniel-butler
a View Christina L. Boyd /people/christina-l-boyd


#### 1.4 Crawler Detection

##### Crawlers are incredibly fast, but also easier to detect and block. 

You can incorporate some pauses to avoid detection. Strategies include: 

1. Using a random number generator to sleep for a random number of seconds
2. After each iteration, sleep for a fixed number of seconds

##### Import module `random` to generate random numbers, and module `time` to control the pauses in your code

* Random-second pause approach

In [35]:
import random
import time

# Script will pause for n seconds
time.sleep(random.uniform(1, 5))
print('Pause Ended')

Pause Ended


* Fixed-second pause approach

In [36]:
time.sleep(5)
print('done')

done


#### 1.5 Remote Drivers

##### Selenium is a “remote driver” of your favorite browser. 

* You can pretty much simulate behavior of a human “surfing the web”. 
* With the right tricks, the likelihood of tracking and blocking your “bot” decreases.
* It also offers flexibility in terms of “unknown” items: you can even look by name of buttons in the page. 

##### There are some downsides though...
  - It is slower
  - It is dependent on your internet connection quality

##### Let's walk through an example using Selenium

* If you haven't already, make sure to install `Selenium` by running
    * `pip install selenium` in terminal or command line, or
    * `!pip install selenium` in a Jupyter notebook cell

download appropriate web driver from browser, e.g. https://chromedriver.chromium.org/downloads


In [65]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys

1. Give the path to your driver.

In [66]:
import os

In [67]:
os.getcwd()

'/Users/claro/Desktop/PythonCamp2025/Day02/Day02_Part02/Lecture'

In [68]:
# # Interactive example:
# driver_path = Service('/Users/almavelazquez/Documents/GitHub/PythonCamp2024/Day04/Lecture/chromedriver')
# # if on Windows, may need to add '.exe' at the end of path
# driver = webdriver.Chrome(service = driver_path)

import time 
driver = webdriver.Chrome()

2. Start the web driver

In [69]:
driver.get('https://www.google.com')
time.sleep(2) 

3. Find the search element and enter text

In [70]:
search = driver.find_element("name", "q")
search.send_keys('WUSTL Political Science')
time.sleep(2)
# just go to the target site: https://polisci.wustl.edu/people/ 
# or use bing (less stricter)
# or use google custom search 
# or.. at the risk of being blocked you could use undetected-chromedriver

4. press Enter / Return (simulate this action using your driver)

In [71]:
search.submit()
time.sleep(4) 

5. Close the browser (make sure to always close your browser after web scraping)

In [72]:
driver.close()

##### Scraping Tips
- Google Chrome is better to track nodes and page sources
- Inspect the source and get to know your document/website!
- Selenium—Use the ’Copy Xpath’ command if you’re having troubles (Find it in "Inspect" in Google Chrome)
- Use time breaks to avoid being blocked and be polite
- Check the Terms of Service (whether you obey them or not). Please don't get sued. 


##### More on Selenium: https://selenium-python.readthedocs.io/locating-elements.html

### 2. Reading and Writing Files 

#### 2.1 Reading Files

1. Import libraries

In [12]:
import os

2. View your working directory

In [13]:
os.getcwd()
import sys


3. Set your working directory 

In [19]:
os.chdir('/Users/claro/Desktop/PythonCamp2025/Day02/Day02_Part02/Lecture')

4. Read lines from the file

In [20]:
# Read all lines as one string
with open('readfile.txt') as f:
  the_whole_thing = f.read()
  print(the_whole_thing)

Here is line 1.
Here is line 2.
This is the final line.


In [21]:
# Read line by line
with open('readfile.txt') as f:
  lines_list = f.readlines()
  for l in lines_list:
    print(l)

Here is line 1.

Here is line 2.

This is the final line.


In [22]:
# More efficiently, we can loop over the file object (i.e. we don't need the variable lines)
with open('readfile.txt') as f:   
  for l in f:
    print(l)

Here is line 1.

Here is line 2.

This is the final line.


In [23]:
# We can also manually open and close files
# I never do this
f =  open('readfile.txt')
print(f.read())
f.close()

Here is line 1.
Here is line 2.
This is the final line.


Tips: 
- Try to minimize the number of times you open and close flies
- It is very expensive and consumes limited resources --> if too many, it leads to errors 

_Source: https://www.geeksforgeeks.org/context-manager-in-python/_


In [24]:
 file_descriptors = [] 
 for x in range(100000000000): 
     file_descriptors.append(open('readfile.txt')) 

OSError: [Errno 24] Too many open files: 'readfile.txt'

#### 2.2 Writing Files

1. Writing files is easy, but be careful not to overwrite the content you actually want
2. See https://stackabuse.com/file-handling-in-python/ for more options

* We need to use the option 'w'

In [29]:
with open('test_writefile.txt', 'w') as f:
    f.write("Hi guys.\n")
    f.write("Does this go on the second line?\n")
    f.writelines(['a\n', 'b\n', 'c\n'])

OSError: [Errno 24] Too many open files: 'test_writefile.txt'

In [None]:
# We use 'a' to append new information to it
with open('test_writefile.txt', 'a') as f:
  f.write("I got appended!")

##### Writing CSV files (pre-pandas)

1. Import csv

In [None]:
import csv

2. Open a file stream and create a `csv` writer object

In [None]:
# Open a file stream and create a CSV writer object
with open('test_writecsv.csv', 'w') as f:
  my_writer = csv.writer(f)
  for i in range(1, 100):
    my_writer.writerow([i, i-1])

3. Now read the `csv` file

In [None]:
with open('test_writecsv.csv', 'r') as f:
  my_reader = csv.reader(f)
  mydat = []
  for row in my_reader:
    mydat.append(row)
print(mydat[0],"\n", mydat[1],"\n", mydat[2],"\n", mydat[3])

['1', '0'] 
 ['2', '1'] 
 ['3', '2'] 
 ['4', '3']


4. Add column names 

In [None]:
# Note that we are writing a new file
with open('test_csvfields.csv', 'w') as f:
  my_writer = csv.DictWriter(f, fieldnames = ("A", "B"))
  my_writer.writeheader()
  for i in range(1, 100):
    my_writer.writerow({"B":i, "A":i-1})

5. Read the new file

In [None]:
b = 0
with open('test_csvfields.csv', 'r') as f:
  my_reader = csv.DictReader(f)
  for row in my_reader:
      if b<5:
          print(row)
          b +=1

{'A': '0', 'B': '1'}
{'A': '1', 'B': '2'}
{'A': '2', 'B': '3'}
{'A': '3', 'B': '4'}
{'A': '4', 'B': '5'}


##### Some Tips

- Tip 1: We may find useful to save webpages for collecting data (to `.html` files)

In [None]:
import os

In [None]:
import requests, time, random, os

def download_page(address, filename, wait=5):
    time.sleep(random.uniform(0, wait))
    headers = {"User-Agent": "Mozilla/5.0 Chrome/124 Safari/537.36"}
    r = requests.get(address, headers=headers)
    r.raise_for_status()  # throw if error (e.g., 503)
    
    if not os.path.exists(filename):
        with open(filename, "wb") as f:
            f.write(r.content)
        print(f"Saved -> {filename}")
    else:
        print("Can't overwrite file " + filename)

download_page("https://polisci.wustl.edu/people/", "polisci_ppl.html")


HTTPError: 503 Server Error: Service Unavailable for url: https://polisci.wustl.edu/people/

In [None]:
def download_page(address, filename, wait = 5):
  time.sleep(random.uniform(0,wait))
  
  page = urllib.request.urlopen(address)
  page_content = page.read()
  if os.path.exists(filename) == False:
    with open(filename, 'w') as p_html:
      p_html.write(str(page_content)) # needed to cast as string
  else:
    print("Can't overwrite file " + filename)

download_page('https://polisci.wustl.edu/people/', "polisci_ppl.html")

HTTPError: HTTP Error 503: Service Unavailable

In [58]:
import requests, time, random, os

def download_page(address, filename, wait=5):
    time.sleep(random.uniform(0, wait))
    headers = {"User-Agent": "Mozilla/5.0 Chrome/124 Safari/537.36"}
    try:
        r = requests.get(address, headers=headers)
        r.raise_for_status()  # throw if error (e.g., 503)
        if not os.path.exists(filename):
            with open(filename, "wb") as f:
                f.write(r.content)
            print(f"Saved -> {filename}")
        else:
            print("Can't overwrite file " + filename)
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error: {e}")
    except Exception as e:
        print(f"Other error: {e}")

download_page("https://polisci.wustl.edu/people/", "polisci_ppl.html")

HTTP error: 503 Server Error: Service Unavailable for url: https://polisci.wustl.edu/people/


In [63]:
! pip install playwright
! playwright install chromium

Collecting playwright
  Downloading playwright-1.54.0-py3-none-macosx_11_0_arm64.whl.metadata (3.5 kB)
  Downloading playwright-1.54.0-py3-none-macosx_11_0_arm64.whl.metadata (3.5 kB)
Collecting pyee<14,>=13 (from playwright)
Collecting pyee<14,>=13 (from playwright)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
  Downloading pyee-13.0.0-py3-none-any.whl.metadata (2.9 kB)
Collecting greenlet<4.0.0,>=3.1.1 (from playwright)
Collecting greenlet<4.0.0,>=3.1.1 (from playwright)
  Downloading greenlet-3.2.4-cp313-cp313-macosx_11_0_universal2.whl.metadata (4.1 kB)
  Downloading greenlet-3.2.4-cp313-cp313-macosx_11_0_universal2.whl.metadata (4.1 kB)
Downloading playwright-1.54.0-py3-none-macosx_11_0_arm64.whl (38.7 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/38.7 MB[0m [31m?[0m eta [36m-:--:--[0mDownloading playwright-1.54.0-py3-none-macosx_11_0_arm64.whl (38.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.7/38.7 MB[0m [

In [64]:
import asyncio, os
from playwright.async_api import async_playwright

async def save_page(url, filename, wait_state="networkidle", timeout_ms=60000):
    os.makedirs(os.path.dirname(filename) or ".", exist_ok=True)
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent=("Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/537.36 "
                        "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36)"),
            java_script_enabled=True,
        )
        page = await context.new_page()
        await page.route("**/*", lambda route: route.continue_())  # keep default
        await page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
        # Give the page time to finish loading assets / client-side rendering
        await page.wait_for_load_state(wait_state, timeout=timeout_ms)
        html = await page.content()
        with open(filename, "w", encoding="utf-8") as f:
            f.write(html)
        print(f"Saved -> {filename}")
        await browser.close()

if __name__ == "__main__":
    asyncio.run(save_page("https://polisci.wustl.edu/people/", "polisci_ppl.html"))


RuntimeError: asyncio.run() cannot be called from a running event loop

Then, we can parse a page that is already saved on your computer even without access to internet. 

In [None]:
with open('polisci_ppl.html') as f:
  myfile = f.read()
  soup = BeautifulSoup(myfile)
# soup.prettify()

- Tip 2: You may also write directly from a website to a `csv` file. This is good practice as it ensures a break 10 hours into the process does not erase all of your data. 
- Tip 3: Use Exception Handling techniques that we covered in Day03

In [None]:
with open('iceland_test.csv', 'w') as f: # set up with the writer
  w = csv.DictWriter(f, fieldnames = ("name", "party", "phone")) # define column names
  w.writeheader() # write the header
  web_address='https://www.althingi.is/altext/cv/en/' # the web address
  web_page = urllib.request.urlopen(web_address) # open the web page
  soup = BeautifulSoup(web_page.read()) # soup the web page
  all_members = soup.find_all('tr') # find the list of names and parties
  for i in range(1,3): # for members 1 and 2 (member 0 is just the table heading)
    # you should also add try/except language to ensure a weird item doesn't break your whole scraper
    try:
      member = {} ## empty dictionary to fill in
      member_i = all_members[i].find_all('td') # subset lower to each individual item
      member["name"] = member_i[0].text # member's name
      member['party'] =  member_i[1].text # member's party
      inner_page_url = web_address + member_i[0].a['href'] # get the extension to their personal page
      inner_page = urllib.request.urlopen(inner_page_url) # open the personal page
      inner_soup = BeautifulSoup(inner_page.read()) # soup the personal page
      member['phone'] = inner_soup.find('a', {'class' : 'tel'}).text # get phone number
    except:
      member['name'] = 'NA'
      member['party'] = 'NA'
      member['phone'] = 'NA'
    w.writerow(member) # write the row for this specific member
    time.sleep(random.uniform(1, 5)) # be polite, sleep!

In [None]:
# Copyright of the original version:

# Copyright (c) 2014 Matt Dickenson
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.