# Advanced Web Scraping and Data Integration with AWS S3

In the rapidly advancing field of data science, the ability to efficiently scrape, compile, and manage large datasets from the web is a valuable skill. This guide delves into enhancing a basic web scraping script to automatically navigate through multiple pages of a website and gather all available data into a single structured format. Furthermore, it discusses the method of securely uploading the compiled dataset to Amazon S3 (Simple Storage Service), ensuring the data is both accessible and stored reliably for further analysis or reporting.

Detailed Explanation
Enhanced Web Scraping
The process begins with modifying an existing web scraping script to handle pagination on a website. This involves using a browser automation tool to interact with the website, navigate through its pages, and collect data from each page. The data typically includes various fields displayed in a tabular format on the website, relevant to the user's needs.

Data Compilation
As the script scrapes each page, it stores the data in a temporary format which is then aggregated into a single, comprehensive dataframe. This dataframe serves as a central repository of all the collected data, allowing for easy manipulation and analysis.

Secure Data Upload to AWS S3
Once the data is compiled, the next step involves uploading it to a cloud storage solution, specifically Amazon S3. This section of the process includes checking for the existence of an S3 bucket (a basic storage unit in Amazon S3), creating one if it does not exist, and then securely transferring the compiled dataset into this bucket. Amazon S3 provides a robust platform for storing large amounts of data securely, with features that manage data availability, security, and compliance.

Automating and Monitoring Uploads
The final part of the script automates the upload process and confirms the successful storage of data. It ensures that data integrity is maintained during the transfer and provides a confirmation once the upload is complete. This automated process is crucial for handling large datasets or regular updates where manual execution would be impractical.

Practical Applications
The combination of advanced web scraping techniques and the integration with Amazon S3 forms a powerful toolset for data scientists and analysts. It allows them to automatically gather and store data from various web sources, making the data ready for analysis, machine learning models, or reporting. This approach is particularly useful for projects requiring regular updates from dynamic web sources, such as daily financial data, social media statistics, or charitable organization registries.

This workflow not only saves time but also enhances the reliability and accessibility of the data, enabling more sophisticated and timely data analysis projects.

In [1]:
###Load modules
import awscli
import boto3
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By

# create a new Chrome driver service
chrome_driver_service = webdriver.chrome.service.Service('C:/Users/RAI/Desktop/Information-Architectures/chromedriver.exe')

# start the service
chrome_driver_service.start()

# create a new Chrome browser instance using the service
browser = webdriver.Chrome(service=chrome_driver_service)

# enter the URL path that needs to be accessed by webdriver
browser.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

# identify xpath of location to select element
inputElement = browser.find_element(By.XPATH, "/html/body/div/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]")
inputElement.send_keys('0')
inputElement1 = browser.find_element(By.XPATH, "/html/body/div/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]").click()
time.sleep(4)  # allow for the page to load by adding a sleep element

# Identify the table to scrape
table = browser.find_element(By.CSS_SELECTOR, 'table.Bordered')
time.sleep(1)

# Create empty dataframe
df_list = []

# Loop through pages of results
while True:
    # Loop through dataframe to export table
    for row in table.find_elements(By.CSS_SELECTOR, 'tr'):
        cols = [cell.text for cell in row.find_elements(By.CSS_SELECTOR, 'td')]
        if len(cols) > 0:  # exclude empty rows
            df_list.append({"Organization Name": cols[0], "NY Reg #": cols[1], "EIN": cols[2], "Registrant Type": cols[3], "City": cols[4], "State": cols[5]})

    # Check if there is another page of results
    next_button = browser.find_elements(By.XPATH, "//a[contains(text(),'Next')]")
    if len(next_button) > 0:
        next_button[0].click()
        time.sleep(4)  # allow for the page to load by adding a sleep element
        table = browser.find_element(By.CSS_SELECTOR, 'table.Bordered')
    else:
        break

# Concatenate all the dataframes into a single dataframe
df = pd.concat([pd.DataFrame(x, index=[0]) for x in df_list], ignore_index=True)

# Display the scraped data
display(df)

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
0,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926,NFP,PHILADELPHIA,PA
1,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670,NFP,ROCHESTER,NY
2,"""Studio 5404"" Inc.",44-39-58,463180470,NFP,MASSAPAQUA,NY
3,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128,NFP,HUDSON,NY
4,"""Y"" Dive, Inc.",48-45-01,854252095,NFP,SAINT ALBANS,NY
...,...,...,...,...,...,...
95,University of Virginia Health Foundtion,40-44-88,412097394,NFP,CHARLOTTESVILLE,VA
96,Violin Player,41-40-19,270773158,NFP,EAST AMHERST,NY
97,"William A. Epps Community Center, Inc.",40-91-11,861074714,NFP,STATEN ISLAND,NY
98,WORLD SOCIETY OF CZESTOCHOWA JEWS AND THEIR DE...,40-46-49,205101779,NFP,NEW YORK,NY


In [2]:
import boto3
from botocore.exceptions import ClientError

# # Set the name of the bucket to create
bucket_name = 'webscraperm10part2'

aws_s3_client = boto3.client('s3',
          aws_access_key_id = 'enter your access key id here',
          aws_secret_access_key = 'enter your secret access key here')

#Below commented code has the access key to my AWS account. 

response = aws_s3_client.list_buckets()
bucket_exist = False

for bucket in response['Buckets']:
    if bucket['Name'] == bucket_name:
        bucket_exist = True
        break

if bucket_exist:
    print("The bucket exists")
else:
    print("The bucket does not exist")

# Create the bucket if it doesn't exist
if not bucket_exist:
    try:
        aws_s3_client.create_bucket(Bucket=bucket_name)
        print(f"{bucket_name} bucket has been created on AWS S3")
    except ClientError as e:
        print(e)
        print(f"{bucket_name} cannot be created on S3")
    except:
        print(f"{bucket_name} cannot be created on S3")

The bucket does not exist
webscraperm10part2 bucket has been created on AWS S3


In [6]:
from io import StringIO
def upload_s3(df,i):
    global aws_s3_client,bucket_name
    csv_buffer = StringIO()
    df.to_csv(csv_buffer,header=True,line_terminator='\n')
    csv_buffer.seek(0)
    aws_s3_client.put_object(Bucket=bucket_name,Body=csv_buffer.getvalue(),Key=i)

# Checking if the bucket exists
response = aws_s3_client.list_buckets()
buckets = [bucket['Name'] for bucket in response['Buckets']]
if bucket_name not in buckets:
    print(f"{bucket_name} bucket does not exist.")
else:
    print("Uploading Data")
    upload_s3(df, 'scrap2_data.csv')
    print("Data uploaded successfully ")

filename= 'charities_bureau_scrape_' #name of your group
datetime = time.strftime("%Y%m%d%H%M%S") #timestamp
filenames3 = "%s%s.csv"%(filename,datetime) #name of the filepath and csv file

#print success message
print("Successfull uploaded file to location:"+str(filenames3))

Uploading Data
Data uploaded successfully 
Successfull uploaded file to location:charities_bureau_scrape_20230412104525.csv
