## Single page Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all the doctors info on page 292.
- Export the content into a CSV file called ```page_292.csv```.

In [1]:
## import libraries
import pandas as pd
import requests

In [2]:
## get url
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action"
response = requests.get(url)
response.status_code

200

In [3]:
## use Pandas to read tables on page
tables = pd.read_html(response.text)
tables

[                                  Physician Last Name  \
 0                                               Kepes   
 1                                           Schneider   
 2                                           Uzochukwu   
 3                                         Bendelstein   
 4                                             Gregory   
 5                                           Loesevitz   
 6                                             Barnard   
 7                                         Vaccariello   
 8                                             Rankine   
 9                                               Ramos   
 10                                              Yonan   
 11                                          Francisco   
 12                                               Weis   
 13                                            Delaney   
 14                                            Sherman   
 15                                           Kollisch   
 16           

In [4]:
## let's look at the second table: (The first table include other elements)
df = tables[1]
df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Kepes,Ira,J.,168769,MD,09/25/2023,09/26/2023,1957
1,Schneider,Howard,Frederick,189533,MD,10/03/2023,09/26/2023,1957
2,Uzochukwu,Nzeadibenma,O.,231276,MD,09/28/2023,09/21/2023,1973
3,Bendelstein,Harold,L.,160938,MD,09/26/2023,09/19/2023,1951
4,Gregory,Joyce,,310488,MD,09/25/2023,09/18/2023,1968
5,Loesevitz,Arthur,Woldemar Roderich,182467,MD,09/21/2023,09/14/2023,1953
6,Barnard,Morris,,195220,MD,09/21/2023,09/14/2023,1963
7,Vaccariello,Charles,J.,104837,MD,09/20/2023,09/13/2023,1941
8,Rankine,Kirk,P.,294106,MD,09/20/2023,09/13/2023,1969
9,Ramos,Erwin,F.,254719,MD,09/12/2023,09/05/2023,1970


In [5]:
## use pandas to write to csv file
df.to_csv("page_292.csv", encoding = "UTF-8", index = False )

## Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last names begin with "P".
- Export the content into a CSV file called ```md_P.csv```.


In [27]:
## base url of site to scrapeabs
base_url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p="

In [28]:
## Let's import the required libaries to create a delay
from random import randrange
import time

In [31]:
## Combined url timed nav with table scrape
counter = 1 ## counter to track
total_pages = 25 ## number of pages we want to scrape
df_all = [] ## list that will hold all the dataframes that are produced
for url_number in range(1,total_pages):
    print(f"Scraping link {counter} of {total_pages - 1}")
    counter+=1 ## increment counter
    link = f"{base_url}{url_number}"
    response = requests.get(link)
    if response.status_code == 200:
        print(f"got it...scraping page...{link}")
        df_list = pd.read_html(response.text) ## turn html table into a df using pandas
        df_all.append(df_list[1]) ## append table in index position 1 to a list
        ## let's not forget to snooze
        snooze = randrange(5,7)
        print(f"snoozing for {snooze} seconds before scraping next link.")
        time.sleep(snooze)

    else:
        print(f"oh no! {link} returned:", response.status_code)
        
df_all[0:3] ## show just a few items of our list

Scraping link 1 of 24
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p=1
snoozing for 6 seconds before scraping next link.
Scraping link 2 of 24
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p=2
snoozing for 6 seconds before scraping next link.
Scraping link 3 of 24
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p=3
snoozing for 6 seconds before scraping next link.
Scraping link 4 of 24
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p=4
snoozing for 6 seconds before scraping next link.
Scraping link 5 of 24
got it...scraping page...https://apps.health.ny.gov/pubdoh/profess

[   Physician Last Name Physician First Name Physician Middle Name  \
 0                 Paal                 Adam                   NaN   
 1                 Pace               Enrico                   NaN   
 2                 Pace              Leonard                   NaN   
 3              Pacetti              Stephen                     J   
 4               Pachas               Hector                     M   
 5              Pacheco                Denny                    J.   
 6                Pacik                Peter                   NaN   
 7                Pacis            Andresito                    B.   
 8                 Pack                    A               Stephen   
 9         Packianathan             Emmanuel                   NaN   
 10        Packianathan             Emmanuel                   NaN   
 11               Padeh                Asher                   NaN   
 12               Padeh                Asher                   NaN   
 13             Padi

In [32]:
## which link broke?
busted_links

[]

In [33]:
## what does df_all hold?
df_all

[   Physician Last Name Physician First Name Physician Middle Name  \
 0                 Paal                 Adam                   NaN   
 1                 Pace               Enrico                   NaN   
 2                 Pace              Leonard                   NaN   
 3              Pacetti              Stephen                     J   
 4               Pachas               Hector                     M   
 5              Pacheco                Denny                    J.   
 6                Pacik                Peter                   NaN   
 7                Pacis            Andresito                    B.   
 8                 Pack                    A               Stephen   
 9         Packianathan             Emmanuel                   NaN   
 10        Packianathan             Emmanuel                   NaN   
 11               Padeh                Asher                   NaN   
 12               Padeh                Asher                   NaN   
 13             Padi

In [34]:
## convert to a single df rather than a list of df
df = pd.concat(df_all, ignore_index = True) 
df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Paal,Adam,,,MD,10/30/2000,,1961.0
1,Pace,Enrico,,166026,MD,08/21/2001,,1956.0
2,Pace,Leonard,,172870,MD,01/15/2002,01/22/2002,1952.0
3,Pacetti,Stephen,J,175021,MD,04/14/2016,04/07/2016,1957.0
4,Pachas,Hector,M,095535,MD,02/11/1993,,
...,...,...,...,...,...,...,...,...
468,Puskas,John,Michael,120273,MD,10/06/2003,10/01/2003,1945.0
469,Putnam,Richard,C,76298,MD,12/11/1995,,
470,Putterman,Alan,P.,147981,DO,09/05/2023,08/29/2023,1954.0
471,Pynckel,Gary,,176711,DO,12/30/2008,12/23/2008,1952.0


In [35]:
## export to csv
df.to_csv("md_P.csv", index = False, encoding = "UTF-8")