# Enhancing Data Acquisition and Management with Advanced Web Scraping and AWS S3 Integration

In the modern digital landscape, the ability to harness and manage vast amounts of data from various online sources is indispensable. This advanced tutorial guides you through the process of enhancing a basic web scraping tool to efficiently navigate and extract data from multiple pages of a website. It further elaborates on how to seamlessly integrate this data into Amazon S3, providing a robust solution for data storage and management.

Detailed Explanation
Web Scraping Enhancement
Initially, the task involves upgrading an existing web scraping script. The script is set up to interact with a web browser programmatically, allowing it to navigate through a website and collect data from multiple pages automatically. This is particularly useful for websites with pagination, where data is spread across several pages. The script uses browser automation tools to load each page, extract necessary information, and handle transitions between pages smoothly.

Data Compilation and Structuring
As the data is scraped from each webpage, it is temporarily stored and then compiled into a comprehensive dataframe. This dataframe acts as a unified structure for all the collected data, making it easier to process and analyze. The structure includes various attributes such as organization names, registration numbers, and other pertinent details that are neatly organized into columns and rows.

Secure and Efficient Data Upload to AWS S3
Following data compilation, the focus shifts to securely uploading the dataset to Amazon Web Services (AWS) S3, a scalable object storage service. This section covers the verification of an existing storage bucket or the creation of a new one if necessary. The process ensures that all data is uploaded securely and efficiently, using proper authentication and error handling methods to prevent data loss and ensure integrity.

Automation and Reliability
The script is designed to automate the entire process, from data scraping to uploading, minimizing manual intervention and potential errors. This automation is crucial for tasks requiring regular data updates or handling large datasets. The integration with AWS S3 also provides a reliable and secure storage solution, making the data easily accessible for future analysis or backup purposes.

Practical Implications
This approach is invaluable for organizations that rely on continuous data updates from various sources. It can be utilized for market research, regulatory compliance, customer behavior analysis, and more. The technique ensures that data-driven decisions are based on the most current and comprehensive data available, thereby enhancing operational efficiency and strategic planning.

In summary, this advanced method not only streamlines the data acquisition process through sophisticated web scraping techniques but also leverages cloud technology for optimal data storage and management, demonstrating a significant advancement in the field of data science and analytics.

In [32]:
###Load modules
import awscli
import boto3
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By

####SCRAPE THE WEBSITE######
# create a new Chrome driver service
chrome_driver_service = webdriver.chrome.service.Service('C:/Users/RAI/Desktop/Information-Architectures/chromedriver.exe')

# start the service
chrome_driver_service.start()

# create a new Chrome browser instance using the service
browser = webdriver.Chrome(service=chrome_driver_service)
#enter the url path that needs to be accessed by webdriver
browser.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

#identify xpath of location to select element
inputElement = browser.find_element(By.XPATH,"/html/body/div/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]")
inputElement.send_keys('0')
inputElement1 = browser.find_element(By.XPATH,"/html/body/div/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]").click()
sleep(4) #allow for the page to load by adding a sleep element
#identify the table to scrape
table = browser.find_element(By.CSS_SELECTOR,'table.Bordered')
sleep(1)

print(table)

<selenium.webdriver.remote.webelement.WebElement (session="94e0fb67d9b6ec8edd7d66a01915af7f", element="b748c026-1a22-4a32-8995-ddefba1c67c8")>


In [33]:
#####CREATE DATE FRAME#####
#create empty dataframe
df =[]

#loop through dataframe to export table
for row in table.find_elements(By.CSS_SELECTOR,'tr'):
      cols = df.append([cell.text for cell in row.find_elements(By.CSS_SELECTOR,'td')])


#update dataframe with header 
df = pd.DataFrame(df, columns = ["Organization Name", "NY Reg #", "EIN" ,"Registrant Type","City","State"])
display(df)

Unnamed: 0,Organization Name,NY Reg #,EIN,Registrant Type,City,State
0,,,,,,
1,"""Forever Captain Poodaman"" The Ahmad Butler Fo...",48-07-16,843800926.0,NFP,PHILADELPHIA,PA
2,"""R"" S.U.C.C.E.S.S. Foundation Inc.",49-06-59,874012670.0,NFP,ROCHESTER,NY
3,"""Studio 5404"" Inc.",44-39-58,463180470.0,NFP,MASSAPAQUA,NY
4,"""THEY ARE HAITIAN"" FUND, INC.",20-63-46,300170128.0,NFP,HUDSON,NY
5,"""Y"" Dive, Inc.",48-45-01,854252095.0,NFP,SAINT ALBANS,NY
6,(ASMA) American Syrian Multicultural Associati...,42-84-63,273130182.0,NFP,BROOKLYN,NY
7,#FeedHamburg,48-37-35,854150318.0,NFP,HAMBURG,NY
8,#HicksStrong Inc.,48-10-48,842612081.0,NFP,CLIFTON PARK,NY
9,#WalkAway Foundation,47-15-80,832820906.0,NFP,CARLSBAD,CA


In [39]:
import boto3
from botocore.exceptions import ClientError

# # Set the name of the bucket to create
bucket_name = 'webscraper_10'

aws_s3_client = boto3.client('s3',
          aws_access_key_id = 'enter your access key id here',
          aws_secret_access_key = 'enter your secret access key here'

#Below commented code has the access key to my AWS account. 
response = aws_s3_client.list_buckets()
bucket_exist = False

for bucket in response['Buckets']:
    if bucket['Name'] == bucket_name:
        bucket_exist = True
        break

if bucket_exist:
    print("The bucket exists")
else:
    print("The bucket does not exist")

# Create the bucket if it doesn't exist
if not bucket_exist:
    try:
        aws_s3_client.create_bucket(Bucket=bucket_name)
        print(f"{bucket_name} bucket has been created on AWS S3")
    except ClientError as e:
        print(e)
        print(f"{bucket_name} cannot be created on S3")
    except:
        print(f"{bucket_name} cannot be created on S3")


The bucket does not exist
webscraperm10 bucket has been created on AWS S3


In [41]:
from io import StringIO
def upload_s3(df,i):
    global aws_s3_client,bucket_name
    csv_buffer = StringIO()
    df.to_csv(csv_buffer,header=True,line_terminator='\n')
    csv_buffer.seek(0)
    aws_s3_client.put_object(Bucket=bucket_name,Body=csv_buffer.getvalue(),Key=i)

# Checking if the bucket exists
response = aws_s3_client.list_buckets()
buckets = [bucket['Name'] for bucket in response['Buckets']]
if bucket_name not in buckets:
    print(f"{bucket_name} bucket does not exist.")
else:
    print("Uploading Data")
    upload_s3(df, 'scrap_data.csv')
    print("Data uploaded successfully ")

filename= 'charities_bureau_scrape_' #name of your group
datetime = time.strftime("%Y%m%d%H%M%S") #timestamp
filenames3 = "%s%s.csv"%(filename,datetime) #name of the filepath and csv file

#print success message
print("Successfull uploaded file to location:"+str(filenames3))

Uploading Data
Data uploaded successfully 
Successfull uploaded file to location:s3:/NAME_OF_YOUR_S3_BUCKET/charities_bureau_scrape_20230412003147.csv
