# Web scraping
## The goal
Create a DataFrame with info on ECE faculty scraped from the department homepage  
https://schulich.ucalgary.ca/departments/electrical-and-computer-engineering/faculty  

## The tools and resources
1. requests (to obtain html from webservers) https://realpython.com/python-requests/

2. Beautifulsoup (to parse html and find things) https://www.crummy.com/software/BeautifulSoup/bs4/doc/

You might also like to read Chapter 1 Your First Web Scraper in Web Scraping with Python, 2nd Ed by Ryan Mitchell available online through the UofC library. Code is on github too https://github.com/REMitchell/python-scraping

For some more html exposure, when navigating the web, do right-click and 'view source' and check if your browser has right-click->'Inspect Element'. Safari on mac does and Chrome does too.

For more information on HTML, w3school might be a good start: https://www.w3schools.com/html/default.asp

## 1. Obtain html from web

In [1]:
import requests

In [2]:
python_scraping_url = 'http://pythonscraping.com/pages/page1.html'
python_scraping_url2= 'http://www.pythonscraping.com/pages/warandpeace.html'
python_scraping_url3 = 'http://pythonscraping.com/pages/page3.html'
schulich_url = "https://schulich.ucalgary.ca/departments/electrical-and-computer-engineering/faculty"




In [3]:
requests.get?

# Not very useful, check out doc?
# https://3.python-requests.org

In [7]:
response=requests.get(python_scraping_url2)
response

<Response [200]>

In [8]:
response.headers

{'Date': 'Fri, 01 Nov 2019 20:27:27 GMT', 'Server': 'Apache', 'Last-Modified': 'Sat, 09 Jun 2018 19:15:59 GMT', 'ETag': '"4121bd1-2dcb-56e3a58bcb54a"', 'Accept-Ranges': 'bytes', 'Content-Length': '11723', 'Cache-Control': 'max-age=1209600', 'Expires': 'Fri, 15 Nov 2019 20:27:27 GMT', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html'}

In [9]:
print(response.text)

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p/>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the first t

## 2. Parse HTML with Beautifulsoup

In [10]:
from bs4 import BeautifulSoup

In [11]:
response=requests.get(python_scraping_url2)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

## 3. Find stuff with Beautifulsoup
**Top-down:** Follow Web scraping with python Ch 1 and Ch2  
**Bottom-up:** Follow Beautifulsoup doc

- access elements directly
- stripped_strings()
- find_all()
- find()

In [16]:
soup.span

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In [17]:
soup.span.text

"Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news."

In [18]:
soup.find_all('span', class_='green')

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

In [19]:
for green in soup.find_all('span', class_='green'):
    print(green.text)

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


In [20]:
for string in soup.stripped_strings:
    print(string)

.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
War and Peace
Chapter 1
"
Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
"
It was in July, 1805, and the speaker was the well-known
Anna
Pavlovna Scherer
, maid of honor and favorite of the
Empress Marya
Fedorovna
. With these words she greeted
Prince Vasili Kuragin
, a man
of high rank and importance, who was the first to arrive at her
reception.
Anna Pavlovna
had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg
, used only by the 

In [21]:
for string in soup.body.stripped_strings:
    print(string)

War and Peace
Chapter 1
"
Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
"
It was in July, 1805, and the speaker was the well-known
Anna
Pavlovna Scherer
, maid of honor and favorite of the
Empress Marya
Fedorovna
. With these words she greeted
Prince Vasili Kuragin
, a man
of high rank and importance, who was the first to arrive at her
reception.
Anna Pavlovna
had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg
, used only by the elite.
All her invitations without exception, written in French, and
de

In [22]:
response = requests.get('http://pythonscraping.com/pages/page3.html')
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

<html>
 <head>
  <style>
   img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
  </style>
 </head>
 <body>
  <div id="wrapper">
   <img src="../img/gifts/logo.jpg" style="float:left;"/>
   <h1>
    Totally Normal Gifts
   </h1>
   <div id="content">
    Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
    <p>
     We haven't figured out how to make online shopping carts yet, but you can send us a check to:
     <br/>
     123 Main St.
     <br/>
     Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
    </p>
   </div>
   <table id="giftList">
    <tr>
     <th>
      Item Title
     </th>
     <th>
      Description
     </th>
     <th>
      Cost
     </th>
     <th>
      Image
     </th>
   

In [24]:
for item in soup.find_all('tr', class_='gift'):
    print(item.td.text)
    print(item.children[1])


Vegetable Basket



TypeError: 'list_iterator' object is not subscriptable

### Using regular expressions

In [25]:
import re

In [29]:
for image in soup.find('table', id='giftList').find_all(src=re.compile(r'^\.\.')):
    print(image["src"])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


## 4. Schulich site

In [33]:
# get webpage
response = requests.get('https://schulich.ucalgary.ca/electrical-computer/faculty-members', verify=False)
# parse with bs
soup = BeautifulSoup(response.text, 'lxml')
# print(soup.body.prettify())

# find things
#    - get each row in the correct div
for profs in soup.find('div', class_='col-sm-12 two-col').find_all('p'):
    for info in profs.stripped_strings:
        print(info)
    
    print('------')


#    - get stripped_strings to build name and title
#    - use find_all href with regex mailto
#    - use find_all href with regex /contents



Dr. Norm Bartley, Senior Instructor
nbartley@ucalgary.ca
View profile >
------
Dr. Laleh Behjat, Professor
laleh@ucalgary.ca
View profile >
------
Dr. Leo Belostotski, Professor
lbelosto@ucalgary.ca
View profile >
------
Dr.
Laura Curiel
, Assistant Professor
laura.curiel@ucalgary.ca
View profile >
------
Dr.
Colin Dalton
, Assistant Professor
cdalton@ucalgary.ca
View profile >
------
Dr. Vassil Dimitrov, Associate Professor
vdimitro@ucalgary.ca
View profile >
------
Dr. Abraham Fapojuwo, Professor, Associate Head (Graduate Studies)
fapojuwo@ucalgary.ca
View profile >
------
Dr. Behrouz Far, Professor
far@ucalgary.ca
View profile >
------
Dr. Elise Fear, Professor
fear@ucalgary.ca
View profile >
------
Dr. Fadhel Ghannouchi, Professor
fghannou@ucalgary.ca
View profile >
------
Dr. Anis Haque,
Teaching Professor
anis@ucalgary.ca
View profile >
------
Dr. Mohamed Helaoui, Assistant Professor
mhelaoui@ucalgary.ca
View profile >
------
Dr. Hadi Hemmati, Assistant Professor
hadi.hemmati@uca