# Doctor Finder Tutorial: Webscraping and Basic Search Engine with Python
Author:  Colby Carter    
    
Last modified: 2/11/2023    
    
A How-To On:    
1. Set up virtual environment and packages
2. Iteratively scrape key fields across webpage
3. Build a simple search engine

## 1. Set up virtual environment and packages

Using Anaconda terminal:

`# create environment for project, e.g.`    
`conda create --name doctor-finder-env python = 3.9.13`    
`activate doctor-finder-env`    
`conda install pipenv` (or pip install)     
`pipenv install [package]` for each package, or update pipfile    
`pipenv shell`     
`jupyter notebook`    

Alternatively using pipenv shell, create a virtual environment and then select in Jupyter lab:    
`python -m ipykernel install --user --name=doctor-finder-env`    

If using a requirements file, with your conda environment (edit PATH accordingly):    
`python -m venv \[PATH]\doctor-finder-env`    
`\[PATH]\doctor-finder-env\bin\pip install -r \[PATH]\requirements.txt`    
`\[PATH]\doctor-finder-env\bin\pip install ipykernel`    
`\[PATH]\doctor-finder-env\bin\python -m ipykernel install --user --name=doctor-finder-env` and select in Jupyter lab

In [1]:
# confirm python version
!python --version

Python 3.9.13


In [1]:
# confirm package versions
# !pip list

In [2]:
# import libraries
import requests

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By

from webdriver_manager.chrome import ChromeDriverManager

import re

import time
from datetime import date

import numpy as np
import pandas as pd

import json

Resources:    
https://beautiful-soup-4.readthedocs.io/en/latest/    
https://www.kaggle.com/code/shadabhussain/web-scraping-with-python-using-beautiful-soup/notebook

## 2. Iteratively scrape key fields across webpage

### Explore Find-a-Doctor Page -- Example using Univ. of Tennessee Medical Center

In [237]:
# get beautiful soup object from URL 
finder_URL = 'https://www.utmedicalcenter.org/medical-care/medical-services/find-a-doctor/'
finder_page = requests.get(finder_URL).text
finder_soup = BeautifulSoup(finder_page, "html.parser")

Some attributes of beautiful soup object:

In [5]:
finder_page[:100]

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/ '

In [7]:
finder_soup.name

'[document]'

In [8]:
finder_soup.title

<title>Doctor Search | UT Medical Center</title>

In [9]:
finder_soup.body.a

<a class="screenreader-text" href="#main" id="skip-nav">
      Skip to main content
    </a>

In [10]:
tag = finder_soup.a
type(tag)

bs4.element.Tag

In [11]:
tag

<a class="screenreader-text" href="#main" id="skip-nav">
      Skip to main content
    </a>

In [12]:
tag.name

'a'

In [13]:
tag['href']

'#main'

In [16]:
# look up first doctor on page (Todd-b-abel)
first_doctor = finder_soup.find('a', class_='relative')
print(first_doctor)
print()
print(first_doctor['href'])

<a class="relative" href="/find-a-doctor/todd-b-abel">
<div class="body-page-content"><article class="media media-image default">
<picture>
<source media="all and (min-width: 1200px)" srcset="/sites/default/files/styles/max_2600x2600/public/imported/doctorProfiles/Abel_Tod_300.jpg?itok=-HRGQTYt 1x" type="image/jpeg"/>
<source media="all and (min-width: 992px)" srcset="/sites/default/files/styles/max_1300x1300/public/imported/doctorProfiles/Abel_Tod_300.jpg?itok=A8UnDwl0 1x" type="image/jpeg"/>
<source media="all and (min-width: 576px)" srcset="/sites/default/files/styles/max_650x650/public/imported/doctorProfiles/Abel_Tod_300.jpg?itok=Dd171kJV 1x" type="image/jpeg"/>
<source srcset="/sites/default/files/styles/max_650x650/public/imported/doctorProfiles/Abel_Tod_300.jpg?itok=Dd171kJV 1x" type="image/jpeg"/>
<img class="w-full" src="/sites/default/files/styles/max_325x325/public/imported/doctorProfiles/Abel_Tod_300.jpg?itok=nvNGrRMh" typeof="foaf:Image"/>
</picture>
</article>
</div>
</a

Loop over Page 1 of doctor-finder and capture all doctor URLs.

In [238]:
# page 1 doctor URLs - use in loop across all pages
def print_URLs(soup):
    """print all the URLs of the doctors listed on UT doctors page"""
    URL_list = []
    doctor_soup = soup.find_all('p', class_='text-lg font-bold')
    for doctor in doctor_soup:
        name = doctor.find('a', class_="relative")
        URL_list+=[name['href']]
#     print()
    return URL_list

In [18]:
doctor_URLs = print_URLs(finder_soup)

In [19]:
doctor_URLs  # first 6

['/find-a-doctor/todd-b-abel',
 '/find-a-doctor/julia-abraham',
 '/find-a-doctor/wala-abusalah',
 '/find-a-doctor/john-h-acker',
 '/find-a-doctor/brittany-l-adams',
 '/find-a-doctor/theresa-adams']

### Get Doctor URLs - Selenium Chrome Driver    
Use Selenium to open a window and click through UTMC doctor pages.    
https://selenium-python.readthedocs.io/getting-started.html    
https://stackoverflow.com/questions/40555930/selenium-chromedriver-executable-needs-to-be-in-path

#### Develop 'Click Next' Step

In [239]:
# open remote Chrome window
driver = webdriver.Chrome(ChromeDriverManager().install())

[WDM] - Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.78M/6.78M [00:00<00:00, 21.0MB/s]


In [240]:
# open find-a-doctor home page
driver.get(finder_ULR)
print(finder_ULR)

https://www.utmedicalcenter.org/medical-care/medical-services/find-a-doctor/


https://selenium-python.readthedocs.io/locating-elements.html#locating-elements    
https://www.testim.io/blog/selenium-click-button/

In [None]:
# element
# <div class="page-btn js-btn-next"><i class="fas fa-arrow-right"></i></div>

In [None]:
# selector
# mk-theme-container > div.pagination-wrapper > div > div.page-btn.js-btn-next

In [71]:
# next_button = driver.find_element(By.CSS_SELECTOR, 'mk-theme-container > div.pagination-wrapper > div > div.page-btn.js-btn-next')

In [None]:
# xpath
# //*[@id="mk-theme-container"]/div[5]/div/div[4]

In [8]:
doctor_URLs = []

finder_page = driver.page_source
finder_soup = BeautifulSoup(finder_page, 'html.parser')
doctor_URLs += print_URLs(finder_soup)

In [9]:
print(doctor_URLs)

[]


In [31]:
# click button for next page using xpage element - NOT IDEAL BUT WORKS
# how to locate xpath: right click on LOAD MORE button, click 'inspect', copy xpath
# next_button = driver.find_element_by_xpath('//*[@id="block-finch-content"]/div/div/div[3]/ul/li/a')
# driver.execute_script("arguments[0].click();", next_button);

In [43]:
# click button using css selector 
#<a class="button unstyle" href="?sort_by=title&amp;search_api_fulltext=&amp;page=2" title="Load more items" rel="next">Load More</a>
next_button = driver.find_element_by_css_selector(".button.unstyle") 
driver.execute_script("arguments[0].click();", next_button);

In [47]:
# soup for new page
next_page = driver.page_source
next_doctor_soup = BeautifulSoup(next_page, 'html.parser')

In [48]:
doctor_URLs += print_URLs(next_doctor_soup)

In [49]:
doctor_URLs[-10:]

['/find-a-doctor/nicholas-g-anderson',
 '/find-a-doctor/charles-g-ange',
 '/find-a-doctor/carlos-angel',
 '/find-a-doctor/jonathan-w-angelle',
 '/find-a-doctor/lisa-l-angelle',
 '/find-a-doctor/kristen-e-anklowitz',
 '/find-a-doctor/andrew-j-anzeljc',
 '/find-a-doctor/joshua-d-arnold',
 '/find-a-doctor/brandon-s-asbury',
 '/find-a-doctor/raye-anne-b-ayo']

In [241]:
driver.quit()

### Loop Over All Doctors

In [244]:
# define number of iterations - this should be made flexible with stopping logic
doc_count = 1178  # as of 2/15/23
docs_per_page = 6
clicks = np.ceil((doc_count-docs_per_page)/docs_per_page)  # includes base page
int(clicks)  # checks out

196

In [257]:
# loop over all 196 "pages"
def doctor_URL_loop(finder_URL, doc_count, docs_per_page=6, sleep=5.0):
    """finder_URL: starting find-a-doctor page"""
    """page_count: to be replaced by dynamic total page count element"""
    """sleep: seconds for next results to load"""

    # install Chrome driver, go to starting page
    driver = webdriver.Chrome(ChromeDriverManager().install())
    driver.get(finder_URL)

    # get page source
    finder_page = driver.page_source
    finder_soup = BeautifulSoup(finder_page, "html.parser")

    # initialize list of URLs
    doctor_URLs = []
#     print(doctor_URLs[-1])

    # get number of clicks
    clicks = int(np.ceil((doc_count-docs_per_page)/docs_per_page))
    
    # click Load More until showing all doctors
    for i in range(clicks):
        # print(i)
        
        # find and click next button
        next_button = driver.find_element_by_css_selector(".button.unstyle")  # why the periods? don't know
        driver.execute_script("arguments[0].click();", next_button);
        
        # pause to load next page?
        time.sleep(sleep)

    # get new page source and soup
    next_page = driver.page_source
    next_doctor_soup = BeautifulSoup(next_page, 'html.parser')

    # print all doc URLs to the list
    doctor_URLs += print_URLs(next_doctor_soup)
    print("Num docs:", len(doctor_URLs))
    if len(doctor_URLs)<doc_count:
        print("ERROR LOADING URLS")

    time.sleep(sleep)
    driver.quit()
    return doctor_URLs

In [258]:
finder_URL = 'https://www.utmedicalcenter.org/medical-care/medical-services/find-a-doctor/'
# doctor_URLs = doctor_URL_loop(finder_URL, doc_count=50)  # tester
doctor_URLs = doctor_URL_loop(finder_URL, doc_count=1178, sleep=6.0)

[WDM] - Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.78M/6.78M [00:00<00:00, 13.9MB/s]


Num docs: 1178


In [259]:
doctor_URLs[:10]

['/find-a-doctor/todd-b-abel',
 '/find-a-doctor/julia-abraham',
 '/find-a-doctor/wala-abusalah',
 '/find-a-doctor/john-h-acker',
 '/find-a-doctor/brittany-l-adams',
 '/find-a-doctor/theresa-adams',
 '/find-a-doctor/lauren-c-ade',
 '/find-a-doctor/michial-adkins',
 '/find-a-doctor/fatima-ahmed',
 '/find-a-doctor/shaun-b-ajinkya']

In [260]:
doctor_URLs[-10:]

['/find-a-doctor/wing-c-yeen',
 '/find-a-doctor/jonathan-d-york',
 '/find-a-doctor/kristie-g-young',
 '/find-a-doctor/thomas-l-young',
 '/find-a-doctor/yorke-d-young',
 '/find-a-doctor/luis-e-zayas-rodriguez',
 '/find-a-doctor/rong-zeng',
 '/find-a-doctor/michael-zinckgraf',
 '/find-a-doctor/nikki-b-zite',
 '/find-a-doctor/miranda-zolman']

In [261]:
url_dict = {}
url_dict["doctor_url"] = doctor_URLs

In [262]:
url_dict["doctor_url"][5]

'/find-a-doctor/theresa-adams'

In [263]:
# from pandas.io.common import is_url
urlDF = pd.DataFrame.from_dict(url_dict, dtype='str')
urlDF.to_csv("./Data/UTMC_doctors_URLs.csv")
print(urlDF.shape)
urlDF.head()

(1178, 1)


Unnamed: 0,doctor_url
0,/find-a-doctor/todd-b-abel
1,/find-a-doctor/julia-abraham
2,/find-a-doctor/wala-abusalah
3,/find-a-doctor/john-h-acker
4,/find-a-doctor/brittany-l-adams


### Pull Doctor Info Off each Profile Page

In [264]:
home_url = 'https://www.utmedicalcenter.org'

In [265]:
urlDF = pd.read_csv("./Data/UTMC_doctors_URLs.csv").drop(["Unnamed: 0"],axis=1)
print(urlDF.shape)
urlDF.head()  # 2/15: at least one doctor dropped...

(1178, 1)


Unnamed: 0,doctor_url
0,/find-a-doctor/todd-b-abel
1,/find-a-doctor/julia-abraham
2,/find-a-doctor/wala-abusalah
3,/find-a-doctor/john-h-acker
4,/find-a-doctor/brittany-l-adams


In [10]:
# for url in urlDFmini["doctor_url"]:
doc_url = home_url + urlDF["doctor_url"][0]
print(doc_url)
doctor_page = requests.get(doc_url).text
doctor_soup = BeautifulSoup(doctor_page, "html.parser")

https://www.utmedicalcenter.org/find-a-doctor/todd-b-abel


In [19]:
# doctor name -- maybe include full title
# <h1 class="doctor-heading max-w-sm text-5xl sm:text-7xl mb-4 !leading-[1.1]">
#                     Brittany L. Adams, FNP-C                </h1>
doctor_soup.find('h1', class_="doctor-heading max-w-sm text-5xl sm:text-7xl mb-4 !leading-[1.1]").text.strip() #[:-4]

'Todd B. Abel, MD'

In [20]:
# primary specialty
# <p class="doctor-subheading uppercase mb-12 text-lg tracking-widest text-grey">Nurse Practitioner</p>
doctor_soup.find('p', class_="doctor-subheading uppercase mb-12 text-lg tracking-widest text-grey").text.strip()

'Neurological Surgery'

In [90]:
# gender
# <p class="mb-5 text-grey">
#           <span class="text-black font-bold">Gender:</span>
#           Male
#         </p>
doctor_soup.find('p', class_="mb-5 text-grey").text.split(":")[1].strip()

'Male'

In [91]:
# languages
# <p class="mb-16 capitalize text-grey">
#           <span class="text-black font-bold">Languages Spoken:</span>
#           <br>
#           English
#         </p>
doctor_soup.find('p', class_="mb-16 capitalize text-grey").text.split(":")[1].strip()

'English'

In [59]:
# short blurb
# <p class="doctor-bio mb-14 text-grey">Dr. Abel has special interests in spine surgery and tethered cord syndrome, and has completed research, presentations and published writing on these topics. </p>
doctor_soup.find('p', class_="doctor-bio mb-14 text-grey").text.strip()

'Dr. Abel has special interests in spine surgery and tethered cord syndrome, and has completed research, presentations and published writing on these topics.'

In [58]:
# about me
# <div class="mb-24 mt-4 mr-4 text-grey">
#                       Dr. Todd B. Abel is a board certified neurosurgeon and is associated with The University of Tennessee Medical Center. He is also a clinical instructor at the University of Tennessee Graduate School of Medicine.
#                   </div>
doctor_soup.find('div', class_="mb-24 mt-4 mr-4 text-grey").text.strip()

'Dr. Todd B. Abel is a board certified neurosurgeon and is associated with The University of Tennessee Medical Center. He is also a clinical instructor at the University of Tennessee Graduate School of Medicine.'

In [40]:
# Clinical focus
# <p class="capitalize text-grey max-w-xs">
#               <span class="text-black font-bold">Clinical Focus:</span>
#               <br>
#               spine trauma
#             </p>
doctor_soup.find('div', class_='md:col-span-7 grid grid-cols-1 md:grid-cols-2 gap-4')\
    .find_all("p", class_='capitalize text-grey max-w-xs')[0].text.split(':')[1].strip()

'spine trauma'

In [41]:
# Specialties
doctor_soup.find('div', class_='md:col-span-7 grid grid-cols-1 md:grid-cols-2 gap-4')\
    .find_all("p", class_='capitalize text-grey max-w-xs')[1].text.split(':')[1].strip()

'Brain and Spinal Cord Injury, Neurological Surgery'

In [52]:
for i in specialtyHTML:
    print(i)

<p class="capitalize text-grey max-w-xs">
<span class="text-black font-bold">Clinical Focus:</span>
<br/>
              spine trauma
            </p>
<p class="capitalize text-grey max-w-xs">
<span class="text-black font-bold">Specialties:</span>
<br/>
              Brain and Spinal Cord Injury, Neurological Surgery
            </p>


In [83]:
specialtyHTML = doctor_soup.find('div', class_='md:col-span-7 grid grid-cols-1 md:grid-cols-2 gap-4')\
                    .find_all("p", class_='capitalize text-grey max-w-xs')
# print(specialtyHTML)
specialtyList = []
for tag in specialtyHTML:
    specialtyList+=[tag.text.split(':')[1].strip()]

specialtyList # only works IF specialty is last--CONFIRM

['spine trauma', 'Brain and Spinal Cord Injury, Neurological Surgery']

In [62]:
# zip code
# <p class="mb-8"><span class="block">1932 Alcoa Highway</span>Knoxville, TN 37920</p>
location = doctor_soup.find('p', class_='mb-8').get_text()
zip_code = re.search(r'\d{5}', location).group()
zip_code

'37920'

In [79]:
# phone number
# <p>
#                 Phone:
#                 865-524-1869
#               </p>
doctor_soup.find('div', class_='text-lg').find('p', class_='mb-8').find_next_sibling().text.split(":")[1].strip()

'865-524-1869'

In [266]:
# get all key fields for a given doctor
def scrape_doc_profile(doc_suffix, home_url='https://www.utmedicalcenter.org'):
    # get soup
    doc_url = home_url + doc_suffix
    doctor_page = requests.get(doc_url).text
    doctor_soup = BeautifulSoup(doctor_page, "html.parser")
    
    # doc name
    if doctor_soup.find('h1', class_="doctor-heading max-w-sm text-5xl sm:text-7xl mb-4 !leading-[1.1]")==None:
        print(doc_url)  # how? 404 error...
#         name = ""
    else:
        name = doctor_soup.find('h1', class_="doctor-heading max-w-sm text-5xl sm:text-7xl mb-4 !leading-[1.1]").text.strip()
    
    # profession
    if doctor_soup.find('p', class_="doctor-subheading uppercase mb-12 text-lg tracking-widest text-grey")==None:
        profession = ""
    else:
        profession = doctor_soup.find('p', class_="doctor-subheading uppercase mb-12 text-lg tracking-widest text-grey").text.strip()
    
    # gender
    if doctor_soup.find('p', class_="mb-5 text-grey")==None:
        gender = ""
    else:
        gender = doctor_soup.find('p', class_="mb-5 text-grey").text.split(":")[1].strip()
    
    # languages
    if doctor_soup.find('p', class_="mb-16 capitalize text-grey")==None:
        languages = ""
    else:
        languages = doctor_soup.find('p', class_="mb-16 capitalize text-grey").text.split(":")[1].strip()
    
    # short bio
    if doctor_soup.find('p', class_="doctor-bio mb-14 text-grey") == None:
        bio = ""
    else:
        bio = doctor_soup.find('p', class_="doctor-bio mb-14 text-grey").text.strip()
    
    # long bio
    if doctor_soup.find('div', class_="mb-24 mt-4 mr-4 text-grey")==None:
        about = ""
    else:
        about = doctor_soup.find('div', class_="mb-24 mt-4 mr-4 text-grey").text.strip()
    
    # clinical interest & speciality
    specialtyHTML = doctor_soup.find('div', class_='md:col-span-7 grid grid-cols-1 md:grid-cols-2 gap-4')\
                    .find_all("p", class_='capitalize text-grey max-w-xs')
    specialtyList = []
#     print(doc_url)
    for tag in specialtyHTML:
        specialtyList += [tag.text.split(':')[1].strip()]
    if len(specialtyList)==2:
        clinical = specialtyList[0]
        specialties = specialtyList[1]
    elif len(specialtyList)==1:
        clinical = ""
        specialties = specialtyList[0]
    else:
        clinical = ""
        specialties = ""
    
    # zip code
    location = doctor_soup.find('p', class_='mb-8').get_text()
    if re.search(r'\d{5}', location)==None:
        zip_code = ""
    else:
        zip_code = re.search(r'\d{5}', location).group()
    
    # phone
    if (doctor_soup.find('div', class_='text-lg')==None) or (doctor_soup.find('div', class_='text-lg').find('p', class_='mb-8') == None):
        phone = ''
    else:
        phone = doctor_soup.find('div', class_='text-lg').find('p', class_='mb-8').find_next_sibling().text.split(":")[1].strip()
    
    # combine into dict
    doc_dict = {"Name": name,
                "Profession": profession,
                "Provider": "UTMC",
                "Website": doc_url,
                "Phone": phone,
                "Zip": zip_code,
                "Gender": gender,
                "Languages": languages,
                "Bio": bio,
                "About": about,
                "Clinical_Interest": clinical,
                "Specialties": specialties               
               }
    return doc_dict

In [267]:
scrape_doc_profile(urlDFmini["doctor_url"][0], home_url)

{'Name': 'Todd B. Abel, MD',
 'Profession': 'Neurological Surgery',
 'Provider': 'UTMC',
 'Website': 'https://www.utmedicalcenter.org/find-a-doctor/todd-b-abel',
 'Phone': '865-524-1869',
 'Zip': '37920',
 'Gender': 'Male',
 'Languages': 'English',
 'Bio': 'Dr. Abel has special interests in spine surgery and tethered cord syndrome, and has completed research, presentations and published writing on these topics.',
 'About': 'Dr. Todd B. Abel is a board certified neurosurgeon and is associated with The University of Tennessee Medical Center. He is also a clinical instructor at the University of Tennessee Graduate School of Medicine.',
 'Clinical_Interest': 'spine trauma',
 'Specialties': 'Brain and Spinal Cord Injury, Neurological Surgery'}

In [268]:
scrape_doc_profile(urlDFmini["doctor_url"][1], home_url)

{'Name': 'Julia A. Abraham, MD',
 'Profession': 'Family medicine physician',
 'Provider': 'UTMC',
 'Website': 'https://www.utmedicalcenter.org/find-a-doctor/julia-abraham',
 'Phone': '865-531-1300',
 'Zip': '37919',
 'Gender': 'Female',
 'Languages': 'English',
 'Bio': 'Primary care physician at Rocky Hill Family Physicians.',
 'About': 'Dr. Abraham is a Wisconsin native but grew up in Charleston, SC.  She studied microbiology at Clemson University before returning home to Charleston for medical school at the Medical University of South Carolina. Dr. Abraham then completed residency at the Mountain Area Health Education Center in Hendsersonville, NC where she fell in love with the Blue Ridge mountains. In her free time, she enjoys cooking, boating and hiking with her husband, young daughter and 2 dogs. She is also a passionate fan (and part-owner) of the Green Bay Packers.',
 'Clinical_Interest': "Preventative Care,  Chronic Disease Management,  Women's Health",
 'Specialties': 'Family

In [269]:
scrape_doc_profile(urlDFmini["doctor_url"][2], home_url)

{'Name': 'Wala  Abusalah, MD',
 'Profession': 'Transplant Nephrologist',
 'Provider': 'UTMC',
 'Website': 'https://www.utmedicalcenter.org/find-a-doctor/wala-abusalah',
 'Phone': '',
 'Zip': '37920',
 'Gender': 'Female',
 'Languages': 'Arabic, English',
 'Bio': 'Family, travelling and self development!',
 'About': 'A great source of happiness is seeing dialysis patients discharged from the hospital with a perfectly working kidney, leaving behind years of stress and anxiety, provoked by the long dialysis hours, fixed schedule, and dietary restrictions. The constantly evolving nature of transplant medicine manifesting in advancement in immunosuppression therapy, surgical techniques, allocation systems and exploring new areas to increase the donor pool fuels my admiration to the field.  I am a mother of two beautiful girls who I enjoy spending every moment with. I like travelling and reading during my free time.',
 'Clinical_Interest': 'Transplant medicine,  immunosuppression, chronic kid

### Apply Over URL Dataframe

In [270]:
# for loop testing
urlDFmini = urlDF.iloc[:10,]
urlDFmini

Unnamed: 0,doctor_url
0,/find-a-doctor/todd-b-abel
1,/find-a-doctor/julia-abraham
2,/find-a-doctor/wala-abusalah
3,/find-a-doctor/john-h-acker
4,/find-a-doctor/brittany-l-adams
5,/find-a-doctor/theresa-adams
6,/find-a-doctor/lauren-c-ade
7,/find-a-doctor/michial-adkins
8,/find-a-doctor/fatima-ahmed
9,/find-a-doctor/shaun-b-ajinkya


In [271]:
urlDFmini["doctor_info"] = urlDFmini["doctor_url"].apply(scrape_doc_profile)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  urlDFmini["doctor_info"] = urlDFmini["doctor_url"].apply(scrape_doc_profile)


In [272]:
urlDFmini

Unnamed: 0,doctor_url,doctor_info
0,/find-a-doctor/todd-b-abel,"{'Name': 'Todd B. Abel, MD', 'Profession': 'Ne..."
1,/find-a-doctor/julia-abraham,"{'Name': 'Julia A. Abraham, MD', 'Profession':..."
2,/find-a-doctor/wala-abusalah,"{'Name': 'Wala Abusalah, MD', 'Profession': '..."
3,/find-a-doctor/john-h-acker,"{'Name': 'John H. Acker, MD', 'Profession': 'C..."
4,/find-a-doctor/brittany-l-adams,"{'Name': 'Brittany L. Adams, FNP-C', 'Professi..."
5,/find-a-doctor/theresa-adams,"{'Name': 'Theresa A. Adams, NP', 'Profession':..."
6,/find-a-doctor/lauren-c-ade,"{'Name': 'Lauren C. Ade, APRN', 'Profession': ..."
7,/find-a-doctor/michial-adkins,"{'Name': 'Michial A. Adkins, CRNA', 'Professio..."
8,/find-a-doctor/fatima-ahmed,"{'Name': 'Fatima Ahmed, DO', 'Profession': 'Fa..."
9,/find-a-doctor/shaun-b-ajinkya,"{'Name': 'Shaun B. Ajinkya, MD', 'Profession':..."


In [275]:
# Full Dataset
urlDF["doctor_info"] = urlDF["doctor_url"].apply(scrape_doc_profile)
urlDF.shape

(1178, 2)

In [276]:
urlDF.head()

Unnamed: 0,doctor_url,doctor_info
0,/find-a-doctor/todd-b-abel,"{'Name': 'Todd B. Abel, MD', 'Profession': 'Ne..."
1,/find-a-doctor/julia-abraham,"{'Name': 'Julia A. Abraham, MD', 'Profession':..."
2,/find-a-doctor/wala-abusalah,"{'Name': 'Wala Abusalah, MD', 'Profession': '..."
3,/find-a-doctor/john-h-acker,"{'Name': 'John H. Acker, MD', 'Profession': 'C..."
4,/find-a-doctor/brittany-l-adams,"{'Name': 'Brittany L. Adams, FNP-C', 'Professi..."


In [277]:
doc_info_list = urlDF["doctor_info"].tolist()

In [278]:
# save full json doc profile data

with open('./Data/UTMC_doctor_profiles.json', 'w') as f:
    json.dump(doc_info_list, f)

### Execute from Command Line    
    
This scraping algorithm as a Python class can be executed by providing the doctor finder URL:    
`python Scrape_UTMC_profiles.py 'https://www.utmedicalcenter.org/medical-care/medical-services/find-a-doctor/'`

## 3. Build a simple search engine -- continued in next notebook...