# Scraping
### JP Maestas
### OCT 26, 2025

The main purpose of this notebook is to gather and collect datasets for our analysis. I will mainly be using the [Pandas](https://pandas.pydata.org/docs/user_guide/index.html) package in Python, as well as some other libraries to conduct statistical testing. This file may require you to download some software in order to run the code, but IPYNB files are nice in that it will let you see the output without it! 

In [None]:
import pandas as pd
from bs4 import BeautifulSoup # web scraping package (useful for downloading)
import requests            # html query package (useful for interacting with html source code)
import time 
import geopandas as gpd
from shapely import geometry

# UNHCR
[Refugee Statistics](https://www.unhcr.org/refugee-statistics)
# Global Human Settlement 
[1975 - 2030](https://human-settlement.emergency.copernicus.eu/ghs_pop2023.php)
# Global History Climatology Network daily
[Daily Climate Surveys](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
# 2024 Energy Usage Report
Study on 2024 Data Center usage from [Berkley Lab](https://eta.lbl.gov/publications/2024-lbnl-data-center-energy-usage-report)
# US Data Center Locator 
[Geolocation of Data Centers](https://map.datacente.rs/)

For the first part of the projec, I am going to scrape the geolocation of the data centers (nationally), and create a plot with geopandas package. Further, I am going to query the list of ZIP codes so we can perform further analysis on the impact in these regions


NOTE! it's always important to read the [robots.txt file](https://www.datacentermap.com/robots.txt) before scraping so you don't violate their terms of service! 

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

base_url = "https://www.datacentermap.com"
usa_url = f"{base_url}/usa/"
headers = {"User-Agent": "JPMM Research_Scraper"}

options = webdriver.ChromeOptions()
options.add_argument("--headless")  # Run in background
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

# Load page
driver.get("https://www.datacentermap.com/usa/")
time.sleep(3)  # Wait for JS to load

# Parse with BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

# Extract state links
state_links = []
for a in soup.select("div#content a"):
    href = a.get("href")
    if href and "/usa/" in href and href.count("/") == 3:
        state_links.append("https://www.datacentermap.com" + href)

print(state_links)

[]


In [18]:

import requests
import pandas as pd
import json

url = "https://map.datacente.rs/api/dc-summary/0744b368-b62b-11e5-ad0b-02b4d6763261"
headers = {
    "User-Agent": "Mozilla/5.0 ( Intel Mac OS X 10_15_7)",
    "Accept": "application/json",
}

r = requests.get(url, headers=headers)
data = r.json()

company = data.get("company", {})
dc = data.get("dc", {})

record = {
    "datacenter_name": dc.get("Name"),
    "datacenter_city": dc.get("City") or company.get("city"),
    "datacenter_country": dc.get("Country") or company.get("country"),
    "datacenter_address": dc.get("Address") or company.get("address"),
    "datacenter_website": dc.get("Website") or company.get("website"),
    "datacenter_email": dc.get("Email") or company.get("email"),
    "datacenter_tel": dc.get("Tel") or company.get("tel"),
    "company_name": company.get("name"),
    "company_description": company.get("short_description"),
    "company_website": company.get("website"),
    "company_email": company.get("email"),
    "uuid": dc.get("uuid"),
}

# Convert to DataFrame and save
df = pd.DataFrame([record])
print(df.T)


                                                                     0
datacenter_name                                         Kanobe Bothell
datacenter_city                                                Bothell
datacenter_country                                       United States
datacenter_address                  3301 Monte Villa Parkway Suite 125
datacenter_website               http://www.kanobe.com/colocation.html
datacenter_email                                      sales@kanobe.com
datacenter_tel                                            425.686.7700
company_name                                                    Kanobe
company_description  Since 2009, KANOBE has been providing Business...
company_website                                  http://www.kanobe.com
company_email                                         sales@kanobe.com
uuid                              0744b368-b62b-11e5-ad0b-02b4d6763261


This scraper finds all the url requests and collects metadata about 