#  Web Scraping Assignment by Taiyo.AI

* NAME: Vishav Bhaat Pal 
* Coarse: Mtech in Computational and Data Science
* College: National Institute of Technology Karnataka
* Roll No: 222CD032

###  Pantaluma City

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the URL of the webpage you want to scrape
url = "https://cityofpetaluma.org/planning-projects/"

# Send a GET request to the webpage
response = requests.get(url)

# Parse the HTML content of the webpage
soup = BeautifulSoup(response.content, "html.parser")

# Find the <div> element with specific classes that contain the links
div = soup.find("div", class_="headingBox panel-widget-style panel-widget-style-for-57679-3-0-1")

# Find all <a> elements within the <div>
links = div.find_all("a")

# Initialize an empty list to store the data
data_list = []


# Iterate over the links (start from the second link)
for link in links[1:]:
    # Get the URL of the webpage
    webpage_url = link['href']

    # Send a GET request to the webpage
    webpage_response = requests.get(webpage_url)

    # Parse the HTML content of the webpage
    webpage_soup = BeautifulSoup(webpage_response.content, "html.parser")

    # Extract data by searching for keywords
    data = {
        'Project Name': None,
        'Address': None,
        'APN': None,
        'Applicant Name': None,
        'Date of Decision': None,
        'Project Description': None,
        'Location Link': None,
        'Posted Date': None,  # Adding Posted Date here
        'URL': webpage_url  # Adding URL here
    }

    # Find all <strong> tags and extract the text next to them
    for strong_tag in webpage_soup.find_all('strong'):
        key = strong_tag.get_text(strip=True).replace(":", "")

        # Extract the value considering different possible structures
        value = ""
        if strong_tag.next_sibling:
            if isinstance(strong_tag.next_sibling, str):
                value = strong_tag.next_sibling.strip()
            elif strong_tag.next_sibling.name:  # If the next sibling is a tag
                value = strong_tag.next_sibling.get_text(strip=True)
        elif strong_tag.find_next_sibling():  # If next_sibling is not directly available
            value = strong_tag.find_next_sibling().get_text(strip=True)

        if key in data:
            data[key] = value

    # Find the anchor tag with the text "Map It"
    map_link = webpage_soup.find('a', string="Map It")

    # Extract the href attribute if the link is found
    location_link = map_link['href'] if map_link else None

    # Add Location Link to the data dictionary
    data['Location Link'] = location_link

    # Find the <p> tag with class "text-muted" and extract the text
    posted_date_tag = webpage_soup.find('p', class_='text-muted')
    posted_date = posted_date_tag.get_text(strip=True) if posted_date_tag else None

    # Add Posted Date to the data dictionary
    data['Posted Date'] = posted_date

    # Append the data to the list
    data_list.append(data)

# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(data_list)

# Reorder columns to place Location Link after Address
df = df[['Project Name', 'Address', 'Location Link', 'APN', 'Applicant Name', 'Date of Decision', 'Project Description', 'Posted Date', 'URL']]

# Set display options to limit the width of the URL column
pd.set_option('display.max_colwidth', 100)

In [7]:
# Function to extract coordinates from a link
def extract_coordinates(link):
    if link:
        # Check if the link contains coordinates
        if '@' in link:
            # Split the link by "@"
            parts = link.split("@")
            # Extract the part containing the coordinates
            coordinates_part = parts[-1]
            # Split the coordinates part by "," to separate latitude and longitude
            coordinates = coordinates_part.split(",")[0:2]
            # Join the latitude and longitude with a comma
            return ','.join(coordinates)
        else:
            return None
    else:
        return None

# Apply the function to the "Location Link" column
df['Coordinates'] = df['Location Link'].apply(extract_coordinates)
df.drop (['Date of Decision','Location Link'], axis = 1, inplace = True)

In [9]:
df.head()

Unnamed: 0,Project Name,Address,APN,Applicant Name,Project Description,Posted Date,URL,Coordinates
0,Sepaher 4-unit Residential Building,315 Lakeville Street [,0007-154-013,"Architect Nicholas Lee, on behalf of property owner Nancy Sepaher",Proposed modification of the 4-unit residential building previously approved for the vacant lot ...,"Posted on December 21, 2022",https://cityofpetaluma.org/sepaher-4-unit-residential-building/,"38.2368624,-122.6353394"
1,MacDuff Work/Live Unit,307 Bodega Avenue [,006-301-025,"Brent Russell, Architect, Studio 202","The application proposes construction of a 2-story, 2,160 square foot structure to be used as on...","Posted on January 4, 2023",https://cityofpetaluma.org/macduff-work-live-unit/,
2,The Floodway LLC Appeal,4875 Petaluma Blvd N/4951 Stony Point Road [,007-422-028,Heather Kratt (H&H Real Estate LLC),Heather Kratt (H&H Real Estate LLC) has filed an applicant appeal to Planning Commission of a de...,"Posted on May 15, 2023",https://cityofpetaluma.org/the-floodway-llc-appeal/,"38.2685807,-122.6733413"
3,Marin Health Petaluma – Clinical Hub,335 South McDowell Blvd [,007-280-046,"Patrick McGaughey, Paragon Commercial Group","The projects propose to establish a Medical Services – Minor use within the existing 13,760 squa...","Posted on May 22, 2023",https://cityofpetaluma.org/marin-health-petaluma-clinical-hub/,"38.247621,-122.6248025"
4,Spring Hill School Conditional Use Permit Modification,705 North Webster Street [,"006-371-039 and -002, 006-441-020","Seth Nobmann, Advanced Building Solutions",The proposed projectrequests modificationofthe existingvested Spring Hill SchoolConditional Use ...,"Posted on July 17, 2023",https://cityofpetaluma.org/spring-hill-school-conditional-use-permit-modification/,"38.2309758,-122.6583301"


### Rohnert Park City

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}

url = 'https://www.rpcity.org/city_hall/departments/development_services/engineering/projects_in_progress'

# Get the webpage content
response = requests.get(url, headers=headers)
webpage = response.text

# Create BeautifulSoup object
soup = BeautifulSoup(webpage, 'lxml')

# Extract Title
all_h3_tags = soup.find_all('h3')
titles = [tag.text.strip() for tag in all_h3_tags[2:]]

# Extract Description
description_tags = soup.find_all('div', class_='accordion-content')
descriptions = [tag.get_text(strip=True) for tag in description_tags]

# Extract Location
location_strong_tags = soup.find_all('strong', string=re.compile(r'Location:', re.IGNORECASE))
locations = []
for strong_tag in location_strong_tags:
    next_sibling = strong_tag.next_sibling
    while next_sibling and (next_sibling.name == 'strong' or next_sibling.strip() == '&nbsp;' or not next_sibling.strip()):
        next_sibling = next_sibling.next_sibling
    if next_sibling:
        locations.append(next_sibling.strip())

# Extract Summary
summary_strong_tags = soup.find_all('strong', string=re.compile(r'Summary:', re.IGNORECASE))
summaries = []
for strong_tag in summary_strong_tags:
    next_sibling = strong_tag.next_sibling
    while next_sibling and (next_sibling.name == 'strong' or next_sibling.strip() == '&nbsp;' or not next_sibling.strip()):
        next_sibling = next_sibling.next_sibling
    if next_sibling:
        summaries.append(next_sibling.strip())

# Extract Project Status
project_status_strong_tags = soup.find_all('strong', string=re.compile(r'PROJECT STATUS', re.IGNORECASE))
project_statuses = []
for strong_tag in project_status_strong_tags:
    next_sibling = strong_tag.next_sibling
    while next_sibling and (next_sibling.name == 'strong' or next_sibling.strip() == '&nbsp;' or not next_sibling.strip()):
        next_sibling = next_sibling.next_sibling
    if next_sibling:
        if next_sibling.name == 'br':
            next_sibling = next_sibling.next_sibling
        if next_sibling and next_sibling.strip():
            project_status_text = next_sibling.strip()
            project_statuses.append(project_status_text)
        else:
            next_sibling = next_sibling.next_sibling
            while next_sibling and not next_sibling.strip():
                next_sibling = next_sibling.next_sibling
            if next_sibling and next_sibling.strip():
                project_status_text = next_sibling.strip()
                project_statuses.append(project_status_text)

# Ensure all lists have the same length
min_length = min(len(titles), len(descriptions), len(locations), len(summaries), len(project_statuses))
titles = titles[:min_length]
descriptions = descriptions[:min_length]
locations = locations[:min_length]
summaries = summaries[:min_length]
project_statuses = project_statuses[:min_length]

# Create DataFrame
df = pd.DataFrame({
    'Title': titles,
    'Description': descriptions,
    'Location': locations,
    'Summary': summaries,
    'Project_Status': project_statuses
})

In [5]:
df.head()

Unnamed: 0,Title,Description,Location,Summary,Project_Status
0,"Various Streets Preventative Maintenance, Proj...",Project also known as L-Section Rehabilitation...,L-Section in Rohnert Park.,The work is described generally as asphalt con...,: Spring 2021
1,Highway 101 Bike and Pedestrian Crossing Feasi...,Project funded by Measure M.Location:The feasi...,The feasibility study will study existing and ...,The US 101 freeway is a major barrier to east-...,: Feasibility study still in preliminary data ...
2,"State Farm Drive Rehabilitation, Project No. 2...",Project funded by Measure M and SB-1.Location:...,State Farm Drive from Rohnert Park Expressway ...,The work is described generally as Alternate A...,: Under construction
3,"2019 Pavement Maintenance Project, Project No....",Project also known as J & G Section Pavement R...,J & G Sections,The work is described generally as asphalt con...,": Complete as of April 3, 2020."
4,"Roundabout - Southwest Blvd/Commerce Blvd, Pro...","Project funded by Public Facilities, Fees, the...",Intersection of Southwest Boulevard and Commer...,This work is to address roadway safety by repl...,Construction phase


### Santa Maria City

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}

# Make a request to the URL with defined headers
response = requests.get('https://www.santamariagroup.com/projects', headers=headers)

# Extract text from the response
webpage = response.text

# Parse the webpage with BeautifulSoup
soup = BeautifulSoup(webpage, 'html.parser')

# Initialize lists to store data
titles = []
descriptions = []
locations = []
statuses = []
clients = []
roles = []

# Extract titles from <h3> tags
for h3 in soup.find_all('h3'):
    titles.append(h3.text.strip())

# Find all <p> tags
paragraphs = soup.find_all('p')

# Initialize variables to store temporary values
description = location = status = client = role = None

# Iterate through each <p> tag
for p in paragraphs:
    # Extract text content of the <p> tag and strip whitespace
    text = p.text.strip()
    
    # Check if the text starts with "Location:", "Status:", "Client:", or "SMG Role:"
    if text.startswith("Location:"):
        # If so, update the corresponding variable
        location = text.split(": ")[1]
    elif text.startswith("Status:"):
        status = text.split(": ")[1]
    elif text.startswith("Client:"):
        client = text.split(": ")[1]
    elif text.startswith("SMG Role:"):
        role = text.split(": ")[1]
    else:
        # If none of the above conditions are met, assume it's a description
        if description:
            # If description already exists, append it to the list along with other details
            descriptions.append(description)
            locations.append(location)
            statuses.append(status)
            clients.append(client)
            roles.append(role)
        # Update description variable with current text
        description = text

# Append the last description to the list
descriptions.append(description)

# Ensure all lists have the same length
max_length = max(len(titles), len(descriptions), len(locations), len(statuses), len(clients), len(roles))
titles += [''] * (max_length - len(titles))
descriptions += [''] * (max_length - len(descriptions))
locations += [''] * (max_length - len(locations))
statuses += [''] * (max_length - len(statuses))
clients += [''] * (max_length - len(clients))
roles += [''] * (max_length - len(roles))

# Create a DataFrame
df = pd.DataFrame({
    "Title": titles,
    "Description": descriptions,
    "Location": locations,
    "Status": statuses,
    "Client": clients,
    "SMG Role": roles
})


In [7]:
df.head()

Unnamed: 0,Title,Description,Location,Status,Client,SMG Role
0,All Nations Church,Conditional Use And Land Use Variance For A 25...,"Tujunga, California",2014 Approved Condition Compliance And Approve...,All Nations Church,Land Use Consultant And Owner’s Representative...
1,Alta Public Schools,Services: SMG Expedited The Process Of Going ...,"Tujunga, California",Completed,Alta Public Schools,Successfully Expedited The Process Of Aiding A...
2,Columbia Apartments,127 Units Preservation Project,Scattered Sites Located In Council Districts 1...,Financing Received 4% Tax Credits And Bonds 2012,Intercontinental Affordable Housing,Represented Client On 4% Financing And TEFRA A...
3,Doheny Eye Institute,Variance To Receive A Parking Reduction For Th...,"Los Angeles, California",Approved And Completed January 2013,Doheny Eye Institute,Lead Lobbyist And Land Use Consultant For The ...
4,Emerson College - West Coast Campus,Development Of West Coast Campus On Sunset Bou...,Hollywood California,Building Completed July 2013,Emerson College,Representation Of Client With The City Of LA F...


### Fontana City

In [11]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Define the headers for the HTTP request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

# List of URLs to scrape
urls = [
    'https://www.fontanaca.gov/3417/Commercial-Use-Development-Projects',
    'https://www.fontanaca.gov/3416/Civic-Use-Development-Projects',
    'https://www.fontanaca.gov/3418/Residential-Developments'
]

# Create lists to hold the scraped data
data = []

# Function to scrape a single URL
def scrape_url(url):
    webpage = requests.get(url, headers=headers).text
    soup = BeautifulSoup(webpage, 'lxml')
    
    for project in soup.find_all('h3', class_="subhead2"):
        title = project.text.strip()
        ul = project.find_next_sibling('ul')
        
        if ul:
            li_elements = ul.find_all('li')
            
            # Extract location
            if len(li_elements) > 0:
                location_element = li_elements[0]
                if location_element.find('span'):
                    location = location_element.find('span').text.strip()
                else:
                    location = location_element.text.split(":")[1].strip() if ":" in location_element.text else "N/A"
            else:
                location = "N/A"
            
            # Extract expected timeline
            if len(li_elements) > 1:
                timeline_element = li_elements[1]
                timeline = timeline_element.text.split(":")[1].strip() if ":" in timeline_element.text else "N/A"
            else:
                timeline = "N/A"
            
            # Extract description
            if len(li_elements) > 2:
                description_element = li_elements[2]
                description = description_element.text.split(":")[1].strip() if ":" in description_element.text else "N/A"
            else:
                description = "N/A"
            
            # Append the data to the list with the URL
            data.append({
                'Title': title,
                'URL': url,
                'Location': location,
                'Expected Timeline': timeline,
                'Description': description
            })

# Loop through the URLs and scrape each one
for url in urls:
    scrape_url(url)

# Create a DataFrame from the list of data with the specified column order
df = pd.DataFrame(data, columns=['Title', 'URL', 'Location', 'Expected Timeline', 'Description'])


In [13]:
df.head()

Unnamed: 0,Title,URL,Location,Expected Timeline,Description
0,"Chrysler, Dodge, Ram, and Jeep",https://www.fontanaca.gov/3417/Commercial-Use-...,South Highland Ave. and Oleander Ave. in the S...,Winter 2022 Completion,"The new Chrysler, Dodge, and Jeep dealership w..."
1,Fontana Hyundai,https://www.fontanaca.gov/3417/Commercial-Use-...,"16850 S Highland Ave, Fontana, CA 92336",Summer 2021,
2,Northgate Market Shopping Center,https://www.fontanaca.gov/3417/Commercial-Use-...,Sierra Ave and San Bernardino Ave. in the Nort...,Construction Starts Winter 2022,A new shopping center has been approved for Ce...
3,La Quinta Hotel,https://www.fontanaca.gov/3417/Commercial-Use-...,Juniper Ave. and Slover Ave. in the Southeast ...,Currently Under Construction,The new hotel will consist of four stories and...
4,Marriott Towne Place Suites,https://www.fontanaca.gov/3417/Commercial-Use-...,10530 Sierra Ave.,Construction Starts Winter 2022,Fontana will soon be home to Marriott Towne Pl...


### Thousand Oaks City

In [8]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_project_details(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    
    data = {}

    # Scrape the h1 tag with the specified class for the title
    title_tag = soup.find('h1', class_='titlewidget-title')
    if title_tag:
        data['Title'] = title_tag.text.strip()

    # List of relevant sections to include
    include_sections = {
        "Project Description": ["h3", "strong"],
        "Project Timeline": ["h3", "strong"],
        "Project Timeframe": ["h3", "strong"],  # Adding Project Timeframe
        "Budget/Funding": ["h3", "strong"],
        "Budget": ["h3", "strong"],  # Adding Budget
        "Project Manager Contact Information:": ["h3", "strong"],
        "Project Contacts:": ["h3", "strong"]  # Adding Project Contacts
    }

    # Function to extract text from tags
    def extract_text(tag):
        if tag.name == 'ul':
            return ' '.join(li.text.strip() for li in tag.find_all('li'))
        return tag.text.strip()

    # Extract sections
    for section, tags in include_sections.items():
        for tag in tags:
            # Search for tags that contain the section text
            section_tag = soup.find(tag, string=lambda text: text and section.lower() in text.lower())
            if section_tag:
                next_tag = section_tag.find_next_sibling()
                content = []
                while next_tag and next_tag.name not in ['h3', 'strong']:
                    content.append(extract_text(next_tag))
                    next_tag = next_tag.find_next_sibling()
                if content:
                    # Use the common key "Project Timeline" for both "Project Timeline" and "Project Timeframe"
                    if section in ["Project Timeline", "Project Timeframe"]:
                        data["Project Timeline"] = ' '.join(content).strip()
                    # Use the common key "Budget/Funding" for both "Budget" and "Budget/Funding"
                    elif section in ["Budget", "Budget/Funding"]:
                        data["Budget/Funding"] = ' '.join(content).strip()
                    # Use the common key "Project Manager Contact Information:" for both variations
                    elif section in ["Project Manager Contact Information:", "Project Contacts:"]:
                        data["Project Manager Contact Information:"] = ' '.join(content).strip()
                    else:
                        data[section] = ' '.join(content).strip()
                break

    # Handle case where content is in <ul> and <li> tags under a <p><strong>...</strong></p>
    for section, tags in include_sections.items():
        for tag in tags:
            section_tag = soup.find(tag, string=lambda text: text and section.lower() in text.lower())
            if section_tag:
                # Check if next sibling is <ul>
                next_tag = section_tag.find_parent().find_next_sibling()
                content = []
                while next_tag and next_tag.name not in ['h3', 'strong', 'p']:
                    if next_tag.name == 'ul':
                        content.append(extract_text(next_tag))
                    next_tag = next_tag.find_next_sibling()
                if content:
                    # Use the common key "Project Timeline" for both "Project Timeline" and "Project Timeframe"
                    if section in ["Project Timeline", "Project Timeframe"]:
                        data["Project Timeline"] = ' '.join(content).strip()
                    # Use the common key "Budget/Funding" for both "Budget" and "Budget/Funding"
                    elif section in ["Budget", "Budget/Funding"]:
                        data["Budget/Funding"] = ' '.join(content).strip()
                    # Use the common key "Project Manager Contact Information:" for both variations
                    elif section in ["Project Manager Contact Information:", "Project Contacts:"]:
                        data["Project Manager Contact Information:"] = ' '.join(content).strip()
                    else:
                        data[section] = ' '.join(content).strip()
                break

    return data

# Correct URL should be specified here
base_url = 'https://www.toaks.org/departments/public-works/construction'  

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win 64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'
}
# Get the initial page to extract project URLs
response = requests.get(base_url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

# Extract project URLs
project_urls = []
for h3 in soup.find_all('h3'):
    a_tag = h3.find('a')
    if a_tag:
        project_urls.append(a_tag['href'])

# Initialize an empty list to store data
data_list = []

# Iterate over project URLs to scrape project information
for url in project_urls:
    project_data = scrape_project_details(url)
    if project_data:
        data_list.append(project_data)

# Convert the data list to a pandas DataFrame
df = pd.DataFrame(data_list)


In [10]:
df.head()

Unnamed: 0,Title,Project Description,Project Timeline,Budget/Funding,Project Manager Contact Information:
0,Conejo Canyons Bridge,The project includes construction of a steel b...,The project construction started in September ...,"The total estimated project cost is $3,200,000...","Saeed Zolfaghari, PE Associate Engineer szolfa..."
1,Electric Vehicle Charging Station Project,"The work includes site preparation, electrical...",,Construction contract has been awarded to Pref...,
2,CI 5435- Erbes Road Fence Replacement Project,"The project includes replacement of 3,500 ft e...",The project is currently in the final design p...,"The total estimated project cost is $330,000 a...",
3,MI 2574 Guard Rail Repairs Project,,,,
4,Hillcrest Drive Bike Lane and Pedestrian Impro...,,The project is currently in the preliminary de...,"The total project budget is $2,890,000. The cu...",


### Arceta City

In [14]:
pip install requests # For Arceta
!pip install selenium
pip install tabulate
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [18]:
import pandas as pd
from IPython.display import display, HTML

In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.cityofarcata.org/421/Completed-Projects"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

data = []
table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
    cols = row.find_all("td")
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

df = pd.DataFrame(data, columns=["City Project Title", "Project Description", "Project Status"])


In [23]:
html = df.head().to_html(index=False)
display(HTML(html))

City Project Title,Project Description,Project Status
,,
8th & 9th Streets Improvement Project,"The City of Arcata contracted with RAO Construction for the 8th & 9th Streets Improvement Project. Work on 8th & 9th Streets took place between I Street & K Street and involved: (1) removing and installing new ADA compliant sidewalks and (2) street improvements, which included crosswalks, intersections, parking, bike lanes, and drainage. Additionally, to further remind, street improvements included updates on paths of travel. 9th street became a one-way vehicle travel to K Street and 8th Street became a one-way vehicle travel from K Street to F Street.",Complete
Arcata South I Street Boat Launch Facility Improvement Project,"This project involved the installation of a boat launch, concrete abutment, minor concrete work, pavement, drainage, and accessibility improvements.The project took place at the end of South I Street in the Arcata Marsh Wildlife Sanctuary, at the edge of Humboldt Bay.Project Plans can be found here.","CompleteIf there are any issues or deficiencies observed by the public, please contact Jak Kirchubel, Engineering Technician I; Office: (707)-825-2174 or via email: jkirchubel@cityofarcata.org"
2023 Arcata Annual Paving Project,"This project involved grinding and replacing existing asphalt; sidewalk infill and replacement; accessible ramp construction; AC curb construction; drainage improvements; curb and gutter construction; wedge and roadway conform grinding; asphalt concrete overlay; raise manhole, valve, cleanout and monument covers to grade; installation of truncated domes; and roadway striping and legends. The project included grinding asphalt, paving, curbs, sidewalk, compaction, pavement markings/stripings and signage.This project occurred along H Street between Sunset Avenue and 11th Street, G Street between 5th and 7th Street, and West End Road between Giuntoli Lane and the City Limits.Project Plans can be found here.","CompleteIf there are any issues or deficiencies that are observed by the public, please contact Jak Kirchubel, Engineering Technician I; Office: (707)-825-2174 or via email: jkirchubel@cityofarcata.org"
Fiber Installation Project,"Lightwave Construction is working in portions of Arcata to construct new conduit and fiber infrastructure for Vero Networks. This system will enable broadband connectivity to high-bandwidth customers in the Arcata/Eureka area such as large enterprise, education, government & carrier/last mile providers.As a state-recognized public utility, Vero Networks will be performing work in the City right-of-way under a permit issued by the City of Arcata. The City has imposed certain requirements to lessen the impact of this work in Arcata’s neighborhoods, which can be found in more detail here.Project plans can be viewed for Phase I & Phase II of the project.",Complete


### Daily City

In [24]:
pip install tabula-py # Daily City

Collecting tabula-py
  Downloading tabula_py-2.9.3-py3-none-any.whl.metadata (7.6 kB)
Downloading tabula_py-2.9.3-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: tabula-py
Successfully installed tabula-py-2.9.3
Note: you may need to restart the kernel to use updated packages.


In [25]:
import tabula

# PDF link
pdf_url = "https://www.dalycity.org/DocumentCenter/View/7309/Current-Projects-List---Updated-January-1-2023-PDF?bidId="

# Read PDF and extract tables
tables = tabula.read_pdf(pdf_url, pages='all', multiple_tables=True)

# Combine tables into a single DataFrame
df = pd.concat(tables)


In [26]:
df.head()

Unnamed: 0,Project Name,Application Type,Plan Case No.,Location/APN,Proposed\rUnit Type,Site Size\r(acres),Proposed\rUnits Qty.,Proposed\rDensity (du/ac),Status,Intake Date,...,Site Size (acres),Unnamed: 1,Proposed Units Qty.,Unnamed: 2,ProposedDensity (du/ac),Unnamed: 3,PlanningContact,Proposed\rNew Sq. Ft.,ProposedType,Planner
0,Point Martin\r- Phase One,Major subdivision,SUB-6-15-11016,Steve Courter Way and Martin\rStreet;\rAPN 005...,Detached\rhomes,1.9,16,8.0,Under construction,"June 8, 2015",...,,,,,,,,,,
1,Point Martin\r- Phase Two,General Plan\rAmendment and\rPD Zone Change,GPA-01-16-011884\rPD-10-15-11781\rCEQA-10-15-1...,APNs 005-031-070 + 47 others;\rSteve Courter W...,Detached\rhomes,8.3,117,14.1,Under construction,"October 29, 2015",...,,,,,,,,,,
2,Pacific Place Retail\rConversion,Zone Change\rMajor Subdivision\rDesign Review\...,ZC-4-16-12036\rSUB-4-16-12037\rUP-4-16-12038\r...,2665 Geneva Avenue\rAPN 005-064-250,Condominiums,1.0,7,7.0,All entitlements\rappoved,"April 14, 2016",...,,,,,,,,,,
3,Mission Street/\rGoethe Street\rMixed-Use Buil...,Use Permit and\rDesign Review,SUB-12-20-14799\rUPR-9-16-12301\rDR-9-16-012302,6098 Mission Street\rAPNs 004-031-160 and 170,Apartment\rbuilding,0.25,36,144.0,All entitlements\rappoved,"September 13, 2016",...,,,,,,,,,,
4,Eastmoor\rMixed-Use\r(CORE),Use Permit and\rDesign Review,UPR-6-19-14076 and\rDR-6-19-14077,493 Eastmoor Avenue\rAPN 008-082-200,Mixed-Use\rApartment\rbuilding,0.37,71,192.9,In plan check,"June 18, 2019",...,,,,,,,,,,


## Automation and Continuous Updating

To automate the data scraping and standardization processes, a robust system can be proposed that continuously updates the data sources. This involves setting up a scheduling mechanism, such as cron jobs, to ensure the data is refreshed at regular intervals, and implementing data standardization procedures to maintain consistency and quality.

**Continuous Data Updates**
To ensure that the data sources are continuously updated, we can use a scheduling tool like cron jobs, which is a time-based job scheduler in Unix-like operating systems. Cron jobs allow us to automate the execution of scripts at specified times and intervals. This automation ensures that our scraping scripts run periodically, fetching the latest data from the target websites without manual intervention.

**Scheduling with Cron Jobs**
Using cron jobs, we can schedule our web scraping scripts to run at intervals that suit the update frequency of our data sources. For example, if the data on the websites is updated daily, we can set the cron job to execute the scraping script once every day.

In [None]:
#Here is an example of how to set up a cron job for a Python scraping script:

#1. Create the Scraping Script: Assume we have a script named scrape_data.py 
#    that performs the web scraping and data standardization tasks.

In [None]:
# scrape_data.py

import requests
from bs4 import BeautifulSoup

def scrape_and_standardize():
    url = 'https://example.com/projects'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract and standardize data
    projects = []
    for project in soup.find_all('div', class_='project'):
        title = project.find('h2').text.strip()
        description = project.find('p').text.strip()
        projects.append({'title': title, 'description': description})

    # Save standardized data to a file or database
    with open('projects.json', 'w') as f:
        json.dump(projects, f, indent=4)

if __name__ == "__main__":
    scrape_and_standardize()


In [None]:
# 2. Set Up the Cron Job: Open the crontab configuration by running crontab -e in the 
# terminal and add the following line to schedule the script to run daily at midnight:

In [None]:
0 0 * * * /usr/bin/python3 /path/to/scrape_data.py

This cron job entry means the scrape_data.py script will be executed every day at 00:00 (midnight).

Production Environment Considerations
In a production environment, several additional considerations ensure the reliability and scalability of the automation system:

* Error Handling and Logging: Implement robust error handling and logging within the scraping script to monitor the success and failure of each run. This helps in diagnosing issues and maintaining system health.

* Notifications: Set up notifications (e.g., email alerts) to inform the system administrators of any critical errors or important events.

* Version Control: Use version control systems (e.g., Git) to manage the codebase, enabling collaborative development and change tracking.

* Scalability: Deploy the scraping scripts on scalable infrastructure (e.g., cloud services) to handle varying loads and ensure high availability.

* Data Storage: Store the scraped data in a centralized, reliable database (e.g., PostgreSQL, MongoDB) that supports querying and analysis. Implement regular backups to prevent data loss.

By automating the data scraping and standardization processes with cron jobs and adhering to production standards, we can ensure the continuous and reliable updating of data sources, maintaining high data quality and consistency.

## Conclusion

* In my web scraping project focused on construction and infrastructure projects and tenders in California, I encountered several challenges and implemented corresponding solutions to address them.

* Understanding the different HTML structures of each website was a significant challenge. Each site had a unique layout, which made it difficult to extract data consistently. To overcome this, I invested time in thoroughly analyzing the HTML of each site and developed custom parsers to handle the variations.

* Maintaining data quality and consistency was another major issue. Ensuring that the scraped data was accurate and consistent across different sources required significant effort. I implemented data validation checks and used data cleaning techniques to standardize and verify the quality of the extracted information.

* Access restrictions on some websites posed additional difficulties. Certain sites did not permit automated scraping, making it hard to obtain the necessary data. To address this, I explored alternative methods such as using headless browsers and rotating IP addresses to mimic human behavior and access the data ethically within legal boundaries.

* Furthermore, some websites provided data in poor quality, which complicated the extraction of the required information. To deal with this, I employed advanced text processing and natural language processing (NLP) techniques to clean and extract relevant data from poorly formatted sources.

* By addressing these challenges with targeted solutions, I was able to effectively scrape and compile valuable data on construction and infrastructure projects and tenders in California.