## UT Journalism Professor Bio Scraper

Data Scraper: José Martínez, martinez307jose@gmail.com

In this project, I wanted to know if there were any professors in my journalism school that had done investigative work in their careers.

The following page, https://journalism.utexas.edu/faculty, displays every professor from the journalism school at UT but with only their name, position, and image. To see more info about their careers, you'd have to click on the image and go to an entirely different page.

To find a professor who had done investigative work, it seemed too complicated to click through every page, so I decided to create this scraper.

Important to note, the page does have a search bar that does the exact function I want it to do, but I wanted to try it on my own for fun.

In essence, I used Selenium to first pull the links from all the pages. I noticed that every page with a professor's bio had "faculty/" in their url, so I filtered for that. Then, I just pulled the page source of each page. Searching for a string in the page source would be similar to searching for a keyword in each page. Thus, the final cell iterates through every link with your keyword and returns the last name of the professors that have that keyword.

For example, if one wants to know if a certain professor of ours has worked at ‘CNN’ or has done ‘audio’ work, one would input the keyword and it would return what professors have that in their bio. 

In [171]:
import pandas as pd

In [172]:
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException

In [180]:
driver = webdriver.Chrome('/Users/josemartinez/Desktop/chromedriver')  

In [134]:
# Pull all links and put them into list
driver.get("https://journalism.utexas.edu/faculty")
pages = []
text = 'faculty'
urls = driver.find_elements_by_tag_name('a')
for url in urls:
        pages.append(url.get_attribute('href'))

In [140]:
# Remove values that have 'None' values
clean_pages = [x for x in pages if x != None]

In [141]:
# Filter for links that have professor bios
faculty_pages = []
for x in clean_pages:
    if 'faculty/' in x:
        faculty_pages.append(x)

In [181]:
# Iterates through every webpage and searches for string inside their page source
profs = []
for page in faculty_pages:
    driver.get(page)
    text = driver.page_source
    if 'investigative' in text:
        for element in driver.find_elements_by_class_name('field--name-field-last-name-faculty-bio'):
            profs.append(element.text)

In [182]:
# New list returns professors
profs

['Dawson', 'Pearson']