## Module 10 Assignment - Scraping a Website
* Author: brandon chiazza
* version 2.0

We will be creating a web scraper to parse a table from the Charities Bureau Website. From the website: “All 
charitable organizations operating in New York State are required by law to register and file annual financial reports 
with the Attorney General's Office. This includes any organization that conducts charitable activities, holds property 
that is used for charitable purposes, or solicits financial or other contributions.”

## Part II. Update web-scraper to iterate all results and load csv file into S3 Bucket

In [4]:
###Load modules
#!pip install webdriver-manager
#!pip install awscli
import awscli
import boto3
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

####SCRAPE THE WEBSITE######
###call the webdriver
s=Service(ChromeDriverManager().install())
driver =  webdriver.Chrome(service=s)

#enter the url path that needs to be accessed by webdriver
driver.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

# Wait for the pagination links to be present
wait = WebDriverWait(driver, 10)

#identify xpath of location to select element
inputElement = driver.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]')
inputElement.send_keys('0')#sends the "0" as the search value for EIN 
inputElement1 = driver.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]').click()
sleep(4) #allow for the page to load by adding a sleep element

next_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[title='Go to page 2']")))
last_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "a[title='Go to page 7']")))
# Iterate through the pages
current_page = 1

#####CREATE DATE FRAME#####
#create empty dataframe
df = []
while True:
    # Extract the data from the current page
    table = driver.find_element(By.CSS_SELECTOR, 'table.Bordered')
    for row in table.find_elements(By.CSS_SELECTOR, 'tr'):
        # Extract the desired data from each row
        # print(row.text)
        df.append([cell.text for cell in row.find_elements(By.CSS_SELECTOR, 'td')])

    # Move to the next page
    if next_link.is_displayed() and next_link.is_enabled():
        next_link.click()
        current_page += 1
        try:
            next_link = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, f"a[title='Go to page {current_page + 1}']")))
        except:
            break
    else:
        break
# Update dataframe with header 
df = pd.DataFrame(df, columns=["Organization Name", "NY Reg #", "EIN", "Registrant Type", "City", "State"])
df.dropna(inplace=True)

display(df)#let's have a look at the data before creating the CSV file and loading it into s3
# Close the browser
driver.quit()

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926,NFP,PHILADELPHIA,PA
2,"""Incredibly Blessed"" Inc",49-54-61,842071758,NFP,STATEN ISLAND,NY
3,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670,NFP,ROCHESTER,NY
4,"""Studio 5404"" Inc.",44-39-58,463180470,NFP,MASSAPAQUA,NY
5,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128,NFP,HUDSON,NY
...,...,...,...,...,...,...
91,"POETIC Foundation, Inc.",43-15-65,800688254,NFP,STATEN ISLAND,NY
92,Riley Family Foundation (20-5973514),43-23-88,205973514,NFP,NORTHPORT,NY
93,"Sacred Music Chorale of Richmond Hill, Inc.",21-27-88,113457320,NFP,RICHMOND HILL,NY
94,S G of Rockland County Inc.,45-47-77,472138530,NFP,SPRING VALLEY,NY


In [5]:
###LOAD THE FILE INTO S3####

# Prepare CSV file name   
pathname = 'database-update-bucket-m10-assignment-konnothbiju-bitterlein' #specify location of s3:/{my-bucket}/
filename = 'charities_bureau_scrape_' # Name of your group
datetime = time.strftime("%Y%m%d%H%M%S") # Timestamp
filenames3 = f"{pathname}/{filename}{datetime}.csv" # S3 path and filename

# Connect to S3
s3 = boto3.client('s3')

# Convert DataFrame to CSV string
csv_buffer = df.to_csv(index=False)

# Upload CSV file to S3
s3.put_object(Bucket=pathname, Key=f"{filename}{datetime}.csv", Body=csv_buffer)

# Print success message
print("Successfully uploaded file to location: s3://{}/{}".format(pathname, filenames3))


Successfully uploaded file to location: s3://database-update-bucket-m10-assignment-konnothbiju-bitterlein/database-update-bucket-m10-assignment-konnothbiju-bitterlein/charities_bureau_scrape_20240409132635.csv


## References
* https://www.programiz.com/python-programming/working-csv-files
* https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket
* https://realpython.com/python-boto3-aws-s3/
* https://robertorocha.info/setting-up-a-selenium-web-scraper-on-aws-lambda-with-python/ 