# Web Scraping for of Staff Details from a High School website




In this project, I have explored the use of BeautifulSoup library to perform web scraping on a high school website. I plan to pull in staff information like their name, role at school and phone number. 

Before we can begin, we should import the BeautifulSoup and requests library. Once, done we should specify the website link where the high school's staff information resides. In this case we focusing on Jacksonville High School and the staff page in particular: https://jhs.jisd.org/apps/staff/

In [1]:
# import our necessary first packages
from bs4 import BeautifulSoup
import requests

In [14]:
# set the url we want to visit
url = "https://jhs.jisd.org/apps/staff/"

# visit that url, and grab the html of said page
html = requests.get(url)

Lets see what we pulled into the html variable.

In [15]:
# .text returns the request content in Unicode
html.text[:500]

'\n\n\n\n\n\n\n\t\t\t\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\n\n\t\t\t\t\n\n\n\n\n\n    \n<!DOCTYPE html>\n<!-- Powered by Edlio -->\n    \n        <html lang="en" class="edlio desktop">\n    \n    <!-- w215 -->\n<head>\n<script>\ndataLayer = [{\n\'CustomerType\': \'DWS Child\',\n\'AccountExternalId\': \'0010b00002HJKP7AAP\',\n\'WebsiteName\': \'Jacksonville High\',\n\'WebsiteId\': \'JACKSON-HS\',\n\'DistrictExternalId\': \'0010b00002HIw95AAD\',\n\'DistrictName\': \'Jacksonville ISD\',\n\'DistrictWebsiteId\': \'JACKSON-D\'\n}];\n</script>\n<script>(function(w,d,s,l,i){w[l]='

Now before we can start performing web scraping on this, we will need to convert this html object into a soup object so we can parse it using python and BS4

In [16]:
# convert this into a soup object
soup = BeautifulSoup(html.text, 'html.parser')


### Retrieving data from the HTML page

Let's first find each staff member's name listed on the page we've loaded.To do this we need to know where in the HTML the restaurant element is housed. In order to find the HTML that renders the staff members' names, we can use Google Chrome's Inspect tool:

> 1. Visit the [URL](https://jhs.jisd.org/apps/staff/). 

> 2. Right-click on an element you are interested in, then choose Inspect (in Chrome). 

> 3. This will open the Developer Tools and show the HTML used to render the selected page element. 

I will be using this method to find tags associated with elements of the page we want to scrape.


In [40]:
# print the staff member's names
soup.find_all(name='a', attrs={'class':'name'})

[<a class="name" href="/apps/pages/index.jsp?uREC_ID=1258461&amp;type=u" id="staff_name_1258461_0">
 	Justin Adams
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=469027&amp;type=u" id="staff_name_469027_0">
 	Mark Alexander
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=634534&amp;type=u" id="staff_name_634534_0">
 	Tiffany Alexander
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=469096&amp;type=u" id="staff_name_469096_0">
 	Jennifer Anderson
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=469098&amp;type=u" id="staff_name_469098_0">
 	David Baez
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=634530&amp;type=u" id="staff_name_634530_0">
 	Kim Baker
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=1027922&amp;type=u" id="staff_name_1027922_0">
 	Nicole Barle
 					</a>,
 <a class="name" href="/apps/pages/index.jsp?uREC_ID=469028&amp;type=u" id="staff_name_469028_0">
 	Donnie Barrier

Now that we found a list of tags containing the staff members' names, let's loop through them all one-by-one to retreive only the text that matters to us. In the following cell, we'll print out the name (and **only** the clean name, not the rest of the html) of each entry.

In [23]:
# for each element you find, print out the name
for entry in soup.find_all(name='a', attrs={'class':'name'}):
    print(entry.text)


	Justin Adams
					

	Mark Alexander
					

	Tiffany Alexander
					

	Jennifer Anderson
					

	David Baez
					

	Kim Baker
					

	Nicole Barle
					

	Donnie Barrier
					

	Jana Bateman
					

	Brittney Batten
					

	Dawn Bauer
					

	Cassie Bayless
					

	Thad Black
					

	Emily Bolton
					

	Jeffrey Boyd
					

	Abby Bradford
					

	Deena Brand
					

	Liz Brents
					

	Mary Carol Brown
					

	Jaime Browning
					

	Kandie Bruckner
					

	Kenny Canady
					

	Verna Cannon
					

	Bindi Caveness
					

	Wayne Coleman
					

	Laura Cook
					

	Randall Covey
					

	Andrew Cullen
					

	Emily Cullen
					

	Kaleb DiCiaccio
					

	James Dorman
					

	Demi Dotson Wilkins
					

	Jillian Dublin
					

	Andrea Earle
					

	Teresa Easterling
					

	Tim Eden
					

	Crystal Foreman
					

	Betsy Foster
					

	Keith Fuller
					

	Jennifer Gilbert
					

	Christopher Giorlando
					

	Jan Gowin
					

	Megan Greenville
					

	Laura Guidry
					

	Nancy Guinn
					

	Chri

###  Use Pandas to create a DataFrame with all the elements we scrape

Like the names, we can pull in other details about the staff members one by one. But it would be much more usable if we can put all the pieces together.

To be most efficient, we want to only do a single loop for each entry on the page. That means we want to find what element all of other other elements (name, role, phone number) is housed within. Once we find that we can loop through each entry and grab the relevant information we want (name, location, price, bookings). 

Thereafter, produce a dataframe with the columns "name", "role", "phone_number" that contains all the entries from the page.

In [24]:
import pandas as pd

In [27]:
# I'm going to create my empty df first
jhs_staff = pd.DataFrame(columns=["name","role","phone_number" ])

Below, you will notice I have used 'if' functions for role and phone numbers. The reason for that is, not all the staff members have a role and phone number listed, which leads to a error. Whenever the code reaches an entry that does not have the specific element I am looking for, it will stop working.

Because of this, I first set the column text to a default value of 'NA' and then try to replace it with the actual value if it exists on the website. As I am working through each entry, I am also appending that information to dataframe rows one by one.

At the end, I have some additional tags of \n and \t in the Name column that I got rid of using the .str.replace method.

In [34]:
for entry in soup.find_all('div', {'class':'user-info ada'}):
    # grab the name
    name = entry.find('a', {'class':'name'}).text
    
    # grab the role 
    role = "NA"
    role_tag = entry.find('span', {'class':'user-position user-data'})
    
    if role_tag:
        role = role_tag.text
    
    # grab the phone number
    phone_number = "NA"
    phone_number_tag = entry.find('a', {'class':'user-phone'})
        
    if phone_number_tag:
        phone_number = phone_number_tag.text
    
    jhs_staff.loc[len(jhs_staff)]=[name, role, phone_number ]

In [36]:
jhs_staff['name'] = jhs_staff['name'].str.replace('\n','')
jhs_staff['name'] = jhs_staff['name'].str.replace('\t','')

Volia! My dataframe is ready and now I can see all the information I need in a nice tabular format.

In [41]:
jhs_staff.head(20)

Unnamed: 0,name,role,phone_number
0,Justin Adams,Teacher/Asst.Softball Coach,903-586-3661
1,Mark Alexander,Head Basketball Coach/JMS Football,903-589-3601
2,Justin Adams,Teacher/Asst.Softball Coach,903-586-3661
3,Mark Alexander,Head Basketball Coach/JMS Football,903-589-3601
4,Tiffany Alexander,Teacher - CTE; Charmer Asst. Director,
5,Jennifer Anderson,Chemistry Teacher/CTE Forensic Science,903-586-3661
6,David Baez,Foreign Language Teacher,903-586-3661
7,Kim Baker,Paraprofessional,
8,Nicole Barle,Pre-AP World Geography and US History,
9,Donnie Barrier,Band Director,903-586-3661 x 7017


Now I want to download this dataframe into a CSV format. I will work use the os library to find the path to my desktop and then use it to export the dataframe. 

In [43]:
import os
print(os.environ['USERPROFILE'] + '\Desktop')

C:\Users\Vinisha Thakkar\Desktop


In [47]:
jhs_staff.to_csv (r'C:\Users\Vinisha Thakkar\Desktop\export_dataframe.csv', index = False, header=True)