# Introduction to Web Scraping with Python
This guide is meant as a brief introduction to web scraping in Python. Focus will be placed on extracting data from html tables.
## What is Web Scraping?
Web scraping is simply the collection of data from websites. Often, this action is automized. In this tutorial, we will use the `requests`, `BeautifulSoup`, and `pandas` Python libraries.
## Scraping a Website for General Data
We will use the `requests` library to get the html source from the website and `BeautifulSoup` to parse it for the information we want. 
### Understand Your Website
First, visit the website and find the data you wish to extract. Look at the form of the URL. Do you need data from multiple pages? How can you directly modify the URL to do that? In this example, we'll extract the list of core faculty members from the UCF Physics Department.

You'll want to use the developer tools of your browser to look at the html of the website. We'll use this to parse the html we get from the `requests` library.

In [2]:
import requests

url = 'https://sciences.ucf.edu/physics/people/'  # give the URL
page = requests.get(url)  # get the html of the website

In [3]:
print(page.text)  # look at the html we got

<!DOCTYPE html>
<html lang="en-us">
	<head>
		<title>People | Physics</title>
<meta name='robots' content='max-image-preview:large' />
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel='dns-prefetch' href='//code.jquery.com' />
<link rel='dns-prefetch' href='//universityheader.ucf.edu' />
<link rel='dns-prefetch' href='//cdnjs.cloudflare.com' />
<link rel='dns-prefetch' href='//use.fontawesome.com' />
<link rel='stylesheet' id='ai1ec_style-css'  href='//sciences.ucf.edu/physics/wp-content/plugins/all-in-one-event-calendar/public/themes-ai1ec/vortex/css/ai1ec_parsed_css.css?ver=2.6.8' type='text/css' media='all' />
<link rel='stylesheet' id='eirudo-yt-responsive-css'  href='https://sciences.ucf.edu/physics/wp-content/plugins/simple-youtube-responsive/css/youtube-responsive.css' type='text/css' media='all' />
<link rel='stylesheet' id='ucf_events_css-css'  href='htt

Now, we'll need to parse this html with `BeautifulSoup`

In [4]:
from bs4 import BeautifulSoup as bs 

soup = bs(page.content, 'html.parser')  # create a beautifulsoup object so we can parse the data

We're going to take a look at our developer console in our browser to see what the html of the page looks like. It seems like the core faculty is under a `section` with `id="core"`. Notice also that the faculty names are attached to level three headings `<h3>` with a class of `class="mt-2 mb-1 person-name"`. We'll tell BeautifulSoup to grab those elements.

In [20]:
core_section = soup.find('section', id='core')  # get the core faculty section

names_collection = core_section.find_all('h3', class_='mt-2 mb-1 person-name')  # get only the <h3> (level three headings) that have the class which corresponds to a faculty entry

for name in names_collection:  # print the collection of name <h3>'s we have
    print(name.prettify())



<h3 class="mt-2 mb-1 person-name">
 Ahlam Al-Rawi
</h3>

<h3 class="mt-2 mb-1 person-name">
 Luca Argenti
</h3>

<h3 class="mt-2 mb-1 person-name">
 Christopher Bennett
</h3>

<h3 class="mt-2 mb-1 person-name">
 Aniket Bhattacharya
</h3>

<h3 class="mt-2 mb-1 person-name">
 Daniel Britt
</h3>

<h3 class="mt-2 mb-1 person-name">
 Thomas Brueckner
</h3>

<h3 class="mt-2 mb-1 person-name">
 Humberto Campins
</h3>

<h3 class="mt-2 mb-1 person-name">
 Debashis Chanda
</h3>

<h3 class="mt-2 mb-1 person-name">
 Zenghu Chang
</h3>

<h3 class="mt-2 mb-1 person-name">
 Bo Chen
</h3>

<h3 class="mt-2 mb-1 person-name">
 Zhongzhou Chen
</h3>

<h3 class="mt-2 mb-1 person-name">
 Leonid Chernyak
</h3>

<h3 class="mt-2 mb-1 person-name">
 Jacquelyn Chini
</h3>

<h3 class="mt-2 mb-1 person-name">
 Michael Chini
</h3>

<h3 class="mt-2 mb-1 person-name">
 Lee Chow
</h3>

<h3 class="mt-2 mb-1 person-name">
 Joshua Colwell
</h3>

<h3 class="mt-2 mb-1 person-name">
 James Cooney
</h3>

<h3 class="mt-2 mb-1

Now that we have this collection from the website, we need to extract the text from within the `<h3>`.

In [22]:
names = [name.text for name in names_collection]  # this is a list comprehension. If it looks scary just use a for loop
print(names)  # print the names!

['Ahlam Al-Rawi', 'Luca Argenti', 'Christopher Bennett', 'Aniket Bhattacharya', 'Daniel Britt', 'Thomas Brueckner', 'Humberto Campins', 'Debashis Chanda', 'Zenghu Chang', 'Bo Chen', 'Zhongzhou Chen', 'Leonid Chernyak', 'Jacquelyn Chini', 'Michael Chini', 'Lee Chow', 'Joshua Colwell', 'James Cooney', 'Enrique Del Barco', 'Kerri Donaldson Hanna', 'Joseph Donoghue', 'Adrienne Dove', 'Archana Dubey', 'Costas Efthimiou', 'Li Fang', 'Xiaofeng Feng', 'Yan Fernandez', 'Elena Flitsiyan', 'Joseph Harrington', 'Masahiro Ishigami', 'Richard Jerousek', 'Michael D Johnson', 'William Kaden', 'Ellen Hyeran Kang', 'Abdelkader Kara', 'Theodora Karalidi', 'Saiful Khondaker', 'Richard Klemm', 'Slava Kokoouline', 'Adam LaMee', 'Michael Leuenberger', 'Arkadiy Lyakh', 'Eduardo Mucciolo', 'Yasuyuki Nakajima', 'Madhab Neupane', 'Robert Peale', 'Talat S Rahman', 'Patrick Schelling', 'Alfons Schulte', 'Sergey Stolbov', 'Suren Tatulian', 'Laurene Tetard', 'Mihai Vaida', 'Christos Velissaris']


With that, we've **successfully** gotten the names of the core Physics faculty.

## Scraping Tables
Most likely, you'll want to to get data from html tables. You can do this with the methods I showed above, but it's honestly so much easier to just use `Pandas`. 

Let's grab the data from the Wikipedia page on top international men's football goal scorers.

In [35]:
import pandas as pd 

results = pd.read_html('https://en.wikipedia.org/wiki/List_of_top_international_men%27s_football_goal_scorers_by_country')  # this returns ALL tables on the page
df = results[0]  # we just want the first table
df.head()

Unnamed: 0,Rank,Player,Country,International goals,Caps,Goals per match,First cap,Last cap,Ref.
0,1,Ali Daei,Iran,109,149,0.73,6 June 1993,21 June 2006,[1]
1,2,Cristiano Ronaldo,Portugal,103,173,0.6,20 August 2003,30 March 2021,[2]
2,3,Mokhtar Dahari,Malaysia,89,142,0.62,5 June 1972,19 May 1985,[3]
3,4,Ferenc Puskás,Hungary,84,85,0.99,20 August 1945,14 October 1956,[4]
4,5,Godfrey Chitalu,Zambia,79,111,0.71,29 June 1968,12 December 1980,[5]


In [36]:
# Let's clean up the table
df.drop(columns='Ref.', inplace=True)  # drop the Ref column
df.head(20)

Unnamed: 0,Rank,Player,Country,International goals,Caps,Goals per match,First cap,Last cap
0,1,Ali Daei,Iran,109,149,0.73,6 June 1993,21 June 2006
1,2,Cristiano Ronaldo,Portugal,103,173,0.6,20 August 2003,30 March 2021
2,3,Mokhtar Dahari,Malaysia,89,142,0.62,5 June 1972,19 May 1985
3,4,Ferenc Puskás,Hungary,84,85,0.99,20 August 1945,14 October 1956
4,5,Godfrey Chitalu,Zambia,79,111,0.71,29 June 1968,12 December 1980
5,6,Hussein Saeed,Iraq,78,137,0.57,5 September 1976,3 March 1990
6,7,Pelé,Brazil,77,92,0.84,7 July 1957,18 July 1971
7,8,Kunishige Kamamoto,Japan,75,76,0.99,3 March 1964,15 June 1977
8,8,Bashar Abdullah[a],Kuwait,75,134,0.56,16 March 1996,26 May 2018
9,10,Sunil Chhetri,India,72,115,0.63,12 June 2005,19 November 2019


Now, do whatever you wish with the table. For example, we could convert the "First cap" and "Last cap" columns to `datetime` and find the difference.

In [47]:
# First convert to datetime type so Python can handle the dates
df['First cap'] = pd.to_datetime(df['First cap']) 
df['Last cap'] = pd.to_datetime(df['Last cap'])

df['Diff cap'] = (df['Last cap'] - df['First cap'])  # get the difference (days by default)
df.head()


Unnamed: 0,Rank,Player,Country,International goals,Caps,Goals per match,First cap,Last cap,Diff cap
0,1,Ali Daei,Iran,109,149,0.73,1993-06-06,2006-06-21,4763 days
1,2,Cristiano Ronaldo,Portugal,103,173,0.6,2003-08-20,2021-03-30,6432 days
2,3,Mokhtar Dahari,Malaysia,89,142,0.62,1972-06-05,1985-05-19,4731 days
3,4,Ferenc Puskás,Hungary,84,85,0.99,1945-08-20,1956-10-14,4073 days
4,5,Godfrey Chitalu,Zambia,79,111,0.71,1968-06-29,1980-12-12,4549 days


Then, sort by the length of days

In [51]:
df.sort_values(by=['Diff cap'],ascending=False)

Unnamed: 0,Rank,Player,Country,International goals,Caps,Goals per match,First cap,Last cap,Diff cap
157,146,George Weah[d],Liberia,18,75,0.24,1986-02-23,2018-09-11,11888 days
190,179,Ildefons Lima,Andorra,11,128,0.09,1997-06-22,2019-11-17,8183 days
8,8,Bashar Abdullah[a],Kuwait,75,134,0.56,1996-03-16,2018-05-26,8106 days
166,157,Mario Frick,Liechtenstein,16,125,0.13,1993-10-26,2015-10-12,8021 days
100,93,Jari Litmanen,Finland,32,137,0.23,1989-10-22,2010-11-17,7696 days
61,57,Michael Mifsud,Malta,42,143,0.29,2000-02-10,2020-11-11,7580 days
19,17,Hossam Hassan,Egypt,68,176,0.39,1985-09-10,2006-02-07,7455 days
23,24,Zlatan Ibrahimović,Sweden,62,118,0.53,2001-01-31,2021-03-28,7361 days
125,118,Eiður Guðjohnsen,Iceland,26,88,0.30,1996-04-24,2016-06-06,7348 days
74,68,Goran Pandev,North Macedonia,37,117,0.32,2001-06-06,2021-03-31,7238 days
