## 1. Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last names begin with "Z".
- Export the content into a CSV file called ```md_Z.csv```.


In [1]:
### add more cells as needed
## importing necessary libraries
from bs4 import BeautifulSoup
import pandas as pd
import requests
from random import randrange ##  allows us to randomize numbers library
import time ## time tracker

In [2]:
## we can use the website to narrow down our search before we scrape
## this link only selects doctors that have last names that start with z
## the element we want to target is a table with class+="changeWidthalignCenter"
## switching "p=1" to "p=" so that it can be formatted, i.e. removing page number

url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p="

In [3]:
##creating function to generate links
def generateLinks(url,total_pages):
    '''
    Provide the base url with number of pages to generate links
    '''
    links = []
    for number in range(1, total_pages + 1): 
        links.append(f"{url}{number}") 
    return links

my_links = generateLinks(url,5)
my_links

['https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4',
 'https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5']

In [4]:
##creating snoozer
def sleepyTime():
    snoozer = randrange(30,60)
    print(f"snoozing for {snoozer} second before next scrape")
    time.sleep(snoozer)
    
##testing snoozer
#for number in range(1,5):
#    sleepyTime()

In [5]:
##creating list processor
def processList(all_dfs, file_name):
    df = pd.concat(all_dfs, ignore_index = True)
    df.to_csv(file_name, encoding = "UTF-8", index = False)
    print(f"{file_name} is in you current folder")
    return df

In [6]:
##creating scraper function
def myScraper(url,total_pages,file_name):
    '''
    Input: List links you would like to scrape, total pages, and file name.
    Output: Final df, list of broken links.
    '''
    my_links = generateLinks(url,total_pages)
    all_dfs = []
    busted_links = []

    counter = 1
    for link in my_links:
        print(f"scraping {counter} of {len(my_links)}")
        counter += 1
        print(f"scraping {link}")
        response = requests.get(link)
        if response.status_code == 200:
            df = pd.read_html(response.text)
            all_dfs.append(df[0][0:-2]) ##figuring out how to exclude the last two rows was tricky!
        else:
            print(f"{link} returned a posted link with response {response.status_code}")
            busted_links.append(link)
        if counter < len(my_links):
            sleepyTime()
        else:
            pass
    final_df = processList(all_dfs,file_name)
    print("all done!")
    return (final_df,busted_links)


In [7]:
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p="
total_pages = 5
file_name = "md_Z.csv"
med_df = myScraper(url,total_pages,file_name)[0]
med_df

scraping 1 of 5
scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1
snoozing for 51 second before next scrape
scraping 2 of 5
scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2
snoozing for 59 second before next scrape
scraping 3 of 5
scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3
snoozing for 34 second before next scrape
scraping 4 of 5
scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4
scraping 5 of 5
scraping https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5
md_Z.csv is in you current folder
all done!


Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842.0,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458.0,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699.0,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457.0,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457.0,MD,03/16/2005,03/09/2005,1941.0
...,...,...,...,...,...,...,...,...
79,Zugec,Mirko,,213710.0,MD,12/08/2020,12/01/2020,1960.0
80,Zulfacar,Mary,,130166.0,MD,10/21/2005,10/14/2005,1940.0
81,Zuniga,Dario,,123324.0,MD,05/07/2002,05/07/2002,1941.0
82,Zuttah,Silas,H,153216.0,MD,01/22/2003,06/17/2003,1953.0


## 2. Conversion function


Write a function that takes string values like ```$12.24267```, ```10,201``` and ```$12,501``` and converts them into floating point numbers like ```12.24```, ```10201.0``` and ```12501.0```

Test it out on those 3 string values.




In [19]:
### add more cells as needed
strings = ["$12.24267","10,201","$12,501"]
def StringConvert(string_list):
    floats = []
    for string in string_list:
        string = string.replace("$","")
        string = string.replace(",","_")
        floats.append(round(float(string),2))
    return floats

floats = StringConvert(strings)
floats

[12.24, 10201.0, 12501.0]