# Faculty Search

So I need to assemble a list of 4 faculty and they need to fulfill this loose set of criteria:
- 3/4 must be from Bioinformatics (DCM&B or CCM&B) at associate-level or higher
- \>=1 must be computational
- \>=1 must be biomedical
- \>=1 must be DCMB faculty
- diversity (female & non-white)

Original text: "At the same time as submitting the abstract, the student must also propose the names of four faculty members for their preliminary examination committee. Three of the four must be Bioinformatics-affiliated faculty. To clarify, this must be DCM&B or CCMB faculty who are either instructor tenure-track or if research-track, at the Associate-level or higher. At least one member must be computational and at least one biomedical, and at least one should be a DCMB faculty member."

So this notebook is supposed to scrape the DCMB catalogue to identify a group of professors who can fit this criteria

In [88]:
# imports
import requests
from bs4 import BeautifulSoup
import cloudscraper
from collections import defaultdict
from random import choice
from time import sleep
from pickle import dump, load
import re

In [58]:
faculty_list_url = "https://medicine.umich.edu/dept/dcmb/faculty/faculty/all?page={}" #0-4
scraper = cloudscraper.create_scraper()  # returns a CloudScraper instance

user_agents = [
  "Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0",
  "Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0",
  "Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0",
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0,'
  ]
ua = choice(user_agents)
print(ua)
page = scraper.get(faculty_list_url.format(0),headers={'User-Agent': ua})

Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0


In [98]:
faculty_to_info = defaultdict(dict)
for i in range(5):
    page = None
    while not page:
        try:
            ua = choice(user_agents)
            print(ua)
            page = scraper.get(faculty_list_url.format(i),headers={'User-Agent': ua})
        except Exception as e:
            naptime = choice(range(0,5,0.1))
            print("encountered DDOS protection on {}. Sleeping for {} seconds & retrying 😴 ...".format(faculty_list_url.format(0),naptime))
            sleep(naptime)
    soup = BeautifulSoup(page.text, 'html.parser')
    for art in soup.find_all("article"):
        for head in art.find_all("h2"):
            for a in head.find_all("a"):
                print(a["href"],a.text)
                faculty_to_info[a.text]["link"] = "https://medicine.umich.edu{}".format(a["href"])

Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
/dept/dcmb/brian-d-athey-phd Brian D. Athey, Ph.D.
/dept/dcmb/philip-c-andrews-phd Philip C Andrews, Ph.D.
/dept/dcmb/sara-aton-phd Sara Aton, Ph.D.
/dept/dcmb/kin-fai-au-phd Kin Fai Au, Ph.D.
/dept/dcmb/james-r-baker-jr-md James R.  Baker, Jr., M.D.
/dept/dcmb/veera-baladandayuthapani-phd Veera Baladandayuthapani, Ph.D.
/dept/dcmb/scott-barolo-phd Scott Barolo, Ph.D.
/dept/dcmb/anthony-bloch Anthony Bloch
/dept/dcmb/michael-boehnke-phd Michael Boehnke, Ph.D.
/dept/dcmb/victoria-booth-phd Victoria Booth, Ph.D.
/dept/dcmb/alan-boyle-phd Alan Boyle, Ph.D.
/dept/dcmb/charles-l-brooks-iii-phd Charles L. Brooks, III, Ph.D.
/dept/dcmb/charles-f-burant-md-phd Charles F.  Burant, M.D., Ph.D.
/dept/dcmb/margit-burmeister-phd Margit Burmeister, Ph.D.
/dept/dcmb/daniel-burns-jr-phd Daniel Burns, Jr, Ph.D.
/dept/dcmb/sally-camper-phd Sally Camper, Ph.D.
/dept/dcmb/sriram-chandrasekaran-phd Sriram Chandrasekaran, Ph.D.
/dept/dcmb/an

In [79]:
test_faculty = list(faculty_to_info.keys())[6]
print(test_faculty)

Scott Barolo, Ph.D.


In [99]:
for prof in faculty_to_info:
    if "bio" not in faculty_to_info[prof]:
        page = None
        for i in range(5):
            try:
                ua = choice(user_agents)
                print(ua)
                page = scraper.get(faculty_to_info[prof]["link"],headers={'User-Agent': ua})
                break
            except Exception as e:
                naptime = choice(range(0,5,0.1))
                print("encountered DDOS protection on {}. Sleeping for {} seconds & retrying 😴 ...".format(faculty_list_url.format(0),naptime))
                sleep(naptime)
        else:
            print("failed on {}, stopping".format(prof))
            break
        soup = BeautifulSoup(page.text)
        for art in soup.find_all("article"):
            faculty_to_info[prof]["bio"] = art.text.strip()
            break

Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0,
Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0
Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0,
Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Firefox/78.0
Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0
Mozilla/5.0 (Windo

In [100]:
print(len(faculty_to_info))
if len(faculty_to_info) == 162:
    with open('faculty_to_info.pkl', 'wb') as file:
        # A new file will be created
        dump(faculty_to_info, file)

162


In [101]:
faculty_to_info["Michael Boehnke, Ph.D."]
# ua = choice(user_agents)
# print(ua)
# page = scraper.get(faculty_to_info["Michael Boehnke, Ph.D."]["link"],headers={'User-Agent': ua})

{'link': 'https://medicine.umich.edu/dept/dcmb/michael-boehnke-phd',
 'bio': 'Areas of Interest    \n\nDesign and Analysis of Human Gene Mapping Studies\nIdentifying Genes for Type 2 Diabetes: FUSION\nIdentifying T2D Variants by DNA Sequencing in Multiethnic Samples\nWhole Genome and Exome Sequencing for Bipolar Disorder\nThe Impact of Human Gene Knockouts in Type 2 Diabetes and Related Traits\nWhole Genome Sequencing for Schizophrenia and Bipolar Disorder in the GPC\nAccelerating Medicines Partnership: Enhancement of the Type 2 Diabetes Knowledge Portal\n \n\nWeb Sites    \n\nCenter for Statistical Genetics'}

In [114]:
query = "chine le"

for prof in faculty_to_info:
    if re.search(query, faculty_to_info[prof]["bio"]):
        print(prof)

Veera Baladandayuthapani, Ph.D.
Johann Gagnon-Bartsch, Ph.D.
Yuanfang Guan, Ph.D.
Hui  Jiang, Ph.D.
Danai  Koutra, Ph.D.
Joonsang Lee, Ph.D.
Jie Liu, Ph.D.
Kayvan Najarian, Ph.D.
Nambi  Nallasamy, M.D.
Matthew  O'Meara, Ph.D.
Clayton  Scott, Ph.D.
Karandeep  Singh, M.D., MMSc
Chandra  Sripada, M.D., Ph.D.
Vinod  Vydiswaran , Ph.D.
Joshua  Welch, Ph.D.
Jenna Wiens, Ph.D.
