# About the Challenge:

Aligned with the United Nations Sustainable Development Goals and the EY Ripples program, the EY Open Science AI & Data Challenge is an annual competition that gives university students, early-career professionals and EY people the opportunity to develop data models using artificial intelligence (AI) and computing technology to create solutions that address critical climate issues, building a more sustainable future for society and the planet.

The 2025 AI & data challenge is focused on a phenomenon known as the urban heat island effect, a situation that occurs due to the high density of buildings and lack of green space and water bodies in urban areas. Temperature variations between rural and urban environments can exceed 10-degrees Celsius in some cases and cause significant health-, social- and energy-related issues. Those particularly vulnerable to heat-related problems include young children, older adults, outdoor workers, and low-income populations.

All output from the challenge can help bring cooling relief to vulnerable communities, but entrants with top scores will take home cash prizes and receive an invitation to an exciting awards celebration.

Problem Statement:

The goal of the challenge is to develop a machine learning model to predict heat island hotspots in an urban location. Additionally, the model should be designed to discern and highlight the key factors that contribute significantly to the development of these hotspots within city environments.

Participants will be given near-surface air temperature data in an index format, which was collected on 24 July 2021 using a ground traverse in the Bronx and Manhattan region of New York City. This dataset constitutes traverse points (latitude and longitude) and their corresponding UHI (Urban Heat Island) index values. Participants will use this dataset to build a regression model to predict UHI Index values for a given set of locations.

It is important to understand that the UHI Index at any given location is indicative of the relative temperature difference at that specific point when compared to the city's average temperature. This index serves as a crucial metric for assessing the intensity of heat within different urban zones.

# Data Description

Target Dataset:

Near-surface air temperature data in an index format was collected on 24 July 2021 across the Bronx and Manhattan regions of New York City in the United States. The data was collected in the afternoon between 3:00 pm and 4:00 pm. This dataset includes time stamps, traverse points (latitude and longitude) and the corresponding Urban Heat Island (UHI) Index values for 11229 data points. These UHI Index values are the target parameters for your model.

Please find the dataset here.

Feature Datasets:

Participants can leverage many datasets to consider for their models. Their ability to analyze which datasets and parameters are the most important for model development will determine the model performance. The following are the recommended satellite datasets:

European Sentinel-2 optical satellite data
NASA Landsat optical satellite data
These datasets can be extracted from Microsoft Planetary Computer Portal's data catalog. Please see the sample notebooks for more details.

Additional Datasets:

Participants can also explore the following datasets in their model development journey:

Building footprints of the Bronx and Manhattan regions
Detailed local weather dataset of the Bronx and Manhattan regions on 24 July 2021
Additionally, participants are allowed to use additional datasets for their models, provided those datasets are open and available to all public users and the source of such datasets are referenced in the model.

Validation Dataset:

After building the machine learning model, you need to predict the UHI index values on the locations identified in the validation dataset. Predictions on the validation dataset need to be saved in a CSV file and uploaded to the challenge platform to get a score on the ranking board.

Supporting Material:

Participants can refer to the following material before starting model development:

Participants' guidance document, which provides a detailed overview of urban heat island concepts, relevant datasets, and suggestions for model development
Jupyter notebook where a sample model has been built by using challenge training data
Sample notebook to download a GeoTIFF image from the Sentinel-2 satellite dataset
How to Get Started video
Tips for Success video
This ZIP file contains all of the required content mentioned above. You will find datasets, sample notebooks and documentation to support the data challenge.

Terms of Use and Licensing requirements for the datasets:

Training Data:

Description: Ground temperature data over New York City on July 24, 2021 (CSV format)
Contributors: Climate, Adaptation, Planning, Analytics (CAPA) Strategies
Data Host: Center for Open Science - https://www.cos.io
Terms of Use: https://github.com/CenterForOpenScience/cos.io/blob/master/TERMS_OF_USE.md
License: Apache 2.0 > https://github.com/CenterForOpenScience/cos.io/blob/master/LICENSE
Satellite Data (Sentinel-2 Sample Output)

Description: Copernicus Sentinel-2 sample data from 2021 obtained from the Microsoft Planetary Computer (TIFF format)
Contributors: European Space Agency (ESA), Microsoft
Data Host: Microsoft Planetary Computer - https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a
Terms of Use: https://sentinel.esa.int/documents/247904/690755/Sentinel_Data_Legal_Notice
License: https://creativecommons.org/licenses/by-sa/3.0/igo/
Building Footprint Data

Description: Building footprint polygons over the data challenge region of interest (KML format)
Contributors: Open Data Team at the NYC Office of Technology and Innovation (OTI) - New York City Open Data Project
Data Host: https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh
Terms of Use: https://www.nyc.gov/html/data/terms.html and https://www.nyc.gov/home/terms-of-use.page
License: https://github.com/CityOfNewYork/nyc-geo-metadata#Apache-2.0-1-ov-file
Weather Data

Description: Detailed weather data collected every 5 minutes at two locations (Bronx and Manhattan). Includes surface air temperature (2-meters), relative humidity, average wind speed, wind direction, and solar flux.
Contributors: Contributors: New York State Mesonet
Data Host: https://nysmesonet.org/
Terms of Use: https://nysmesonet.org/about/data
License: https://nysmesonet.org/documents/NYS_Mesonet_Data_Access_Policy.pdf

In [19]:
import pandas as pd

# Load UHI index data
uhi_data = pd.read_csv('Training_data_uhi_index.csv')
# Standardize column names (remove spaces)
uhi_data.columns = uhi_data.columns.str.strip()
uhi_data.head()

Unnamed: 0,Longitude,Latitude,datetime,UHI Index
0,-73.909167,40.813107,24-07-2021 15:53,1.030289
1,-73.909187,40.813045,24-07-2021 15:53,1.030289
2,-73.909215,40.812978,24-07-2021 15:53,1.023798
3,-73.909242,40.812908,24-07-2021 15:53,1.023798
4,-73.909257,40.812845,24-07-2021 15:53,1.021634


In [20]:
import geopandas as gpd

# Load building footprints
building_data = gpd.read_file('Building_Footprint.kml')
building_data.head()

Unnamed: 0,Name,Description,geometry
0,,,"MULTIPOLYGON (((-73.91903 40.8482, -73.91933 4..."
1,,,"MULTIPOLYGON (((-73.92195 40.84963, -73.92191 ..."
2,,,"MULTIPOLYGON (((-73.9205 40.85011, -73.92045 4..."
3,,,"MULTIPOLYGON (((-73.92056 40.8514, -73.92053 4..."
4,,,"MULTIPOLYGON (((-73.91234 40.85218, -73.91247 ..."


In [21]:
# Load weather data
weather_data = pd.read_excel('NY_Mesonet_Weather.xlsx')
weather_data.head()

In [22]:
print(uhi_data.columns)


Index(['Longitude', 'Latitude', 'datetime', 'UHI Index'], dtype='object')


In [23]:
uhi_data.rename(columns=lambda x: x.strip(), inplace=True)  # Removes extra spaces
uhi_data.rename(columns={'UHI Index': 'UHI_Index'}, inplace=True)  # Rename if needed

In [24]:
from sklearn.preprocessing import MinMaxScaler

if 'UHI_Index' in uhi_data.columns:
    scaler = MinMaxScaler()
    uhi_data['UHI_Index_Normalized'] = scaler.fit_transform(uhi_data[['UHI_Index']])
else:
    print("Error: 'UHI_Index' column not found in dataset!")


In [25]:
print(uhi_data.head())


   Longitude   Latitude          datetime  UHI_Index  UHI_Index_Normalized
0 -73.909167  40.813107  24-07-2021 15:53   1.030289              0.824866
1 -73.909187  40.813045  24-07-2021 15:53   1.030289              0.824866
2 -73.909215  40.812978  24-07-2021 15:53   1.023798              0.752674
3 -73.909242  40.812908  24-07-2021 15:53   1.023798              0.752674
4 -73.909257  40.812845  24-07-2021 15:53   1.021634              0.728610


In [26]:
print(uhi_data.head())

   Longitude   Latitude          datetime  UHI_Index  UHI_Index_Normalized
0 -73.909167  40.813107  24-07-2021 15:53   1.030289              0.824866
1 -73.909187  40.813045  24-07-2021 15:53   1.030289              0.824866
2 -73.909215  40.812978  24-07-2021 15:53   1.023798              0.752674
3 -73.909242  40.812908  24-07-2021 15:53   1.023798              0.752674
4 -73.909257  40.812845  24-07-2021 15:53   1.021634              0.728610


In [27]:
import geopandas as gpd
from shapely.geometry import Point

# Convert UHI data into a GeoDataFrame
uhi_gdf = gpd.GeoDataFrame(uhi_data, 
                           geometry=gpd.points_from_xy(uhi_data['Longitude'], uhi_data['Latitude']),
                           crs="EPSG:4326")  # Use WGS84 coordinate system

# Load Building Footprint Data
building_data = gpd.read_file('Building_Footprint.kml')

# Perform spatial join (Corrected)
uhi_gdf = gpd.sjoin(uhi_gdf, building_data, how="left", predicate="intersects")  # FIXED

# Calculate building count per area
uhi_gdf['Building_Count'] = uhi_gdf.groupby('index_right')['index_right'].transform('count')

# Replace NaNs with 0 where no buildings were found
uhi_gdf['Building_Count'].fillna(0, inplace=True)

# Drop unnecessary columns
uhi_gdf.drop(columns=['index_right'], inplace=True)

# Convert back to DataFrame
uhi_data = pd.DataFrame(uhi_gdf)

print(uhi_data.head())


   Longitude   Latitude          datetime  UHI_Index  UHI_Index_Normalized  \
0 -73.909167  40.813107  24-07-2021 15:53   1.030289              0.824866   
1 -73.909187  40.813045  24-07-2021 15:53   1.030289              0.824866   
2 -73.909215  40.812978  24-07-2021 15:53   1.023798              0.752674   
3 -73.909242  40.812908  24-07-2021 15:53   1.023798              0.752674   
4 -73.909257  40.812845  24-07-2021 15:53   1.021634              0.728610   

                     geometry Name Description  Building_Count  
0  POINT (-73.90917 40.81311)  NaN         NaN             0.0  
1  POINT (-73.90919 40.81304)  NaN         NaN             0.0  
2  POINT (-73.90922 40.81298)  NaN         NaN             0.0  
3  POINT (-73.90924 40.81291)  NaN         NaN             0.0  
4  POINT (-73.90926 40.81284)  NaN         NaN             0.0  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  uhi_gdf['Building_Count'].fillna(0, inplace=True)
