# NASA SPACE CHALLENGE 2025
# Regression Model for Heatwaves (Thermal Stress Days)
This Colab aims to develop a preliminary version of a machine learning model that can illustrate the trends and context of urban warming.

# Equipo de trabajo
- Barrantes Gallardo, Diana Michelle
- Diaz Becerra, Dider Anthony
- Valdiviezo Zavaleta, Jesús Arturo

## Objetivo
El objetivo de este proyecto es desarrollar y entrenar un modelo de regresión capaz de predecir la variable CL_WDS: Días de estrés por calor. Esta variable representa el número de días al año en los que la temperatura del aire superficial supera los 32 °C. Al proporcionar predicciones precisas, este modelo actuará como una herramienta de apoyo a la toma de decisiones para planificadores urbanos, agencias de salud pública y gestores de recursos, facilitando el desarrollo de estrategias de mitigación y adaptación en el contexto del cambio climático y la urbanización.






## Importación de librerías

In [1]:
import pandas as pd

## Dataset `GHS_UCDB_THEME_CLIMATE_GLOBE_R2024A`


In [2]:
#Accediendo al dataset existente en github
url = 'https://raw.githubusercontent.com/dm-barr/NasaSpaceApps/refs/heads/main/DataSets/Temperatura/SV15PLTB_PALS_TA_WG_M500_v040_20150818_both.csv'


# Cargar el dataset
df = pd.read_csv(url) # Ruta relativa o completa en tu entorno

# Mostrar las primeras filas
df.head()

Unnamed: 0,Date,Lat,Lon,TAV,TAH,TIR,sec
0,20150818,31.4986,-110.9933,-9999.0,-9999.0,-9999.0,-9999.0
1,20150818,31.4986,-110.9881,-9999.0,-9999.0,-9999.0,-9999.0
2,20150818,31.4986,-110.9829,-9999.0,-9999.0,-9999.0,-9999.0
3,20150818,31.4986,-110.9777,-9999.0,-9999.0,-9999.0,-9999.0
4,20150818,31.4986,-110.9725,-9999.0,-9999.0,-9999.0,-9999.0


# Exploración del dataset


## Preparación del dataset

In [3]:
# Tipos de datos y valores faltantes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19780 entries, 0 to 19779
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0       Date    19780 non-null  int64  
 1          Lat  19780 non-null  float64
 2          Lon  19780 non-null  float64
 3       TAV     19780 non-null  float64
 4       TAH     19780 non-null  float64
 5       TIR     19780 non-null  float64
 6         sec   19780 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 1.1 MB


In [4]:
# Dimensiones, cuantas filas y cuantas columnas. Analizar datos faltantes y tipo de dato
print("Dimensiones del dataset:", df.shape)


# Estadísticas descriptivas
df.describe(include='all')

Dimensiones del dataset: (19780, 7)


Unnamed: 0,Date,Lat,Lon,TAV,TAH,TIR,sec
count,19780.0,19780.0,19780.0,19780.0,19780.0,19780.0,19780.0
mean,20150818.0,31.693733,-110.39938,-318.063878,-340.269024,-566.981107,48021.50318
std,0.0,0.114046,0.344379,2420.818728,2415.278318,2358.574775,15046.527179
min,20150818.0,31.4986,-110.9933,-9999.0,-9999.0,-9999.0,-9999.0
25%,20150818.0,31.5949,-110.6976,283.6,258.1,18.8,46936.225
50%,20150818.0,31.69365,-110.3994,287.9,264.2,21.0,51552.85
75%,20150818.0,31.7925,-110.1011,290.4,269.2,25.9,55123.8
max,20150818.0,31.8891,-109.8055,295.9,284.8,34.2,58328.8


Date: Fecha del dato (todas son iguales en este archivo, 20150818).

Lat: Latitud geográfica del punto de medición.

Lon: Longitud geográfica del punto de medición.

TAV: Temperatura aparente visible (en grados Kelvin o similar).

TAH: Temperatura aparente horizontal (en grados Kelvin o similar).

TIR: Temperatura infrarroja (en grados Kelvin o similar).

sec: Tiempo en segundos (podría ser tiempo desde la medianoche u otro tipo de tiempo acumulado).

# Task
Train a K-Nearest Neighbors regression model using 'Lat' and 'Lon' as independent variables and 'TAV' as the target variable. Address negative and null values in these columns through imputation. Split the data, train the model, and evaluate its performance. Describe each step in a text block.

## Data cleaning and preprocessing


In [5]:
import numpy as np

# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Replace -9999.0 with NaN
df[['TAV', 'Lat', 'Lon']] = df[['TAV', 'Lat', 'Lon']].replace(-9999.0, np.nan)

# Impute missing values with the mean
df['TAV'] = df['TAV'].fillna(df['TAV'].mean())
df['Lat'] = df['Lat'].fillna(df['Lat'].mean())
df['Lon'] = df['Lon'].fillna(df['Lon'].mean())


# Verify no missing values
print(df[['TAV', 'Lat', 'Lon']].isnull().sum())

TAV    0
Lat    0
Lon    0
dtype: int64


In [6]:
import numpy as np

# Strip whitespace from column names
df.columns = df.columns.str.strip()

# Replace -9999.0 with NaN
df[['TAV', 'Lat', 'Lon']] = df[['TAV', 'Lat', 'Lon']].replace(-9999.0, np.nan)

# Impute missing values with the mean
df['TAV'].fillna(df['TAV'].mean(), inplace=True)
df['Lat'].fillna(df['Lat'].mean(), inplace=True)
df['Lon'].fillna(df['Lon'].mean(), inplace=True)

# Verify no missing values
print(df[['TAV', 'Lat', 'Lon']].isnull().sum())

TAV    0
Lat    0
Lon    0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TAV'].fillna(df['TAV'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Lat'].fillna(df['Lat'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

## Feature and target selection

Define the independent variables (features) as 'Lat' and 'Lon', and the target variable as 'TAV'.


In [7]:
features = ['Lat', 'Lon']
target = 'TAV'

## Data splitting
Splitting the dataset into training and testing sets.


In [8]:
from sklearn.model_selection import train_test_split

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (15824, 2)
Shape of X_test: (3956, 2)
Shape of y_train: (15824,)
Shape of y_test: (3956,)


## Model training
Trainning a K-Nearest Neighbors regression model using the training data.


In [9]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor(n_neighbors=5)

knn_model.fit(X_train, y_train)

## Model evaluation
Evaluate the trained model using appropriate metrics on the testing data.


In [10]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the testing data
y_pred = knn_model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2) Score: {r2}")

Mean Squared Error (MSE): 0.24311604726995426
R-squared (R2) Score: 0.9853252049756697


In [11]:
import joblib
import os

# Define the path to save the model in your Google Drive
model_save_path = '/content/drive/My Drive/NASA/knn_regression_model.pkl'

# Ensure the directory exists
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)

# Save the trained model
joblib.dump(knn_model, model_save_path)

print(f"Model saved successfully to: {model_save_path}")

Model saved successfully to: /content/drive/My Drive/NASA/knn_regression_model.pkl


In [12]:
!pip install Flask joblib

