# Nasa Space App Challenge 2025
# Regression model for population density per square meter
This Colab aims to develop a preliminary version of a machine learning model that can illustrate the trends and context of urban warming.

# Work team:
- Barrantes Gallardo, Diana Michelle
- Diaz Becerra, Dider Anthony
- Valdiviezo Zavaleta, Jesús Arturo


# Library Imports

This code block is the setup phase for a machine learning project in Python. It's designed to prepare the environment and import all the necessary libraries for data analysis, visualization, and model building, specifically for a classification task.


In [None]:
!pip install seaborn scikit-learn matplotlib

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_curve, roc_auc_score,
                             confusion_matrix, classification_report)



# Step 1: Load the dataset

This code block handles loading a data file directly from a remote URL into a Pandas DataFrame. This process is a fundamental first step in a data science project, as it makes the data available for all subsequent analysis and manipulation tasks.


In [None]:
import pandas as pd

#Accediendo al dataset existente en github
url = 'https://raw.githubusercontent.com/dm-barr/NasaSpaceApps/refs/heads/main/DataSets/Poblacion/gpw_v4_admin_unit_center_points_population_estimates_rev11_nam.csv'


# Cargar el dataset
df = pd.read_csv(url) # Ruta relativa o completa en tu entorno

# Mostrar las primeras filas
df.head()

Unnamed: 0,GUBID,ISOALPHA,COUNTRYNM,NAME1,NAME2,NAME3,NAME4,NAME5,NAME6,CENTROID_X,...,A60_64M,A65PLUSM,A65_69M,A70PLUSM,A70_74M,A75PLUSM,A75_79M,A80PLUSM,A80_84M,A85PLUSM
0,{79D49630-1D1B-480E-B2B6-57D2CB5E924A},NAM,Namibia,Hardap,Rehoboth Urban West,30698002,,,,17.032845,...,1.510019,2.911404,1.129798,1.781605,0.890803,0.890803,0.488855,0.401948,0.217269,0.184679
1,{482878E4-E470-404E-823C-0BD2DC87B57C},NAM,Namibia,Khomas,Katutura East,60401086,,,,17.040778,...,2.783886,5.150188,1.832725,3.317464,1.322346,1.995118,1.067156,0.927962,0.347986,0.579976
2,{BF5EA3B2-69C4-431D-A8C5-930E12F29CB2},NAM,Namibia,Kunene,Kamanajb,70299007,,,,15.71841,...,5.106322,8.125712,3.507821,4.617891,2.22014,2.397751,1.243278,1.154473,0.488431,0.666042
3,{672349C9-002C-406F-9B00-AD7B8DD18256},NAM,Namibia,Omusati,Ogongo,100399017,,,,15.440616,...,6.437725,19.956947,4.966245,14.990702,3.862635,11.128067,3.617388,7.510679,3.096239,4.41444
4,{10042BEE-2F5B-4C89-B037-C7DFA78FF5AD},NAM,Namibia,Khomas,Windhoek West,60901073,,,,17.065756,...,3.257268,4.722691,2.125211,2.59748,1.31263,1.28485,0.604227,0.680623,0.326421,0.354202


# Task
Implement a KNN model using the columns 'INSIDE_X' and 'INSIDE_Y' as predictor variables and 'UN_2020_DS' as the target variable. Split the data into training and test sets, train the model, evaluate it, and make predictions with dummy data.

## Preparación de datos

### Subtask:
Select the predictor variables ('INSIDE_X', 'INSIDE_Y') and the target variable ('UN_2020_DS'), and split the data into training and test sets.

In [None]:
X = df[['INSIDE_X', 'INSIDE_Y']]
y = df['UN_2020_DS']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

display(X_train.head())
display(y_train.head())

Unnamed: 0,INSIDE_X,INSIDE_Y
5238,18.368689,-18.152039
1034,15.985457,-17.968062
3515,17.076818,-22.971768
931,16.757094,-17.641154
5000,17.365388,-21.07053


Unnamed: 0,UN_2020_DS
5238,1.806926
1034,44.946864
3515,421.745011
931,6.311011
5000,2855.423614


# KNN Model Training
# Subtask:
Train a KNN model for regression with the training data.

# Explanation of the Process
The next step after preparing the data is to train the model. This is a crucial phase where the K-Nearest Neighbors (KNN) algorithm `learns` from your training data.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

knn_model = KNeighborsRegressor(n_neighbors=5)

knn_model.fit(X_train, y_train)

## KNN Model Evaluation

### Subtask:
Evaluate the performance of the KNN model using the same regression metrics (`MSE`, `RMSE`, `R-squared`).


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred_knn = knn_model.predict(X_test)

mse_knn = mean_squared_error(y_test, y_pred_knn)
rmse_knn = np.sqrt(mse_knn)
r2_knn = r2_score(y_test, y_pred_knn)

print(f'KNN Mean Squared Error (MSE): {mse_knn}')
print(f'KNN Root Mean Squared Error (RMSE): {rmse_knn}')
print(f'KNN R-squared (R2): {r2_knn}')

KNN Mean Squared Error (MSE): 20418275.471556213
KNN Root Mean Squared Error (RMSE): 4518.6585920554135
KNN R-squared (R2): 0.7014441794733726


## Predictions with Dummy Data

### Subtask:
Use the trained KNN model to make predictions with the same dummy data and apply post-processing to ensure the predictions aren't negative.


In [None]:
dummy_data = pd.DataFrame({
    'INSIDE_X': [-78.5, 20.0, 15.0],
    'INSIDE_Y': [-7.1667, -18.5, -20.0]
})

predictions_knn = knn_model.predict(dummy_data)

predictions_knn[predictions_knn < 0] = 0

print("KNN Predictions for dummy data:")
print(predictions_knn)

KNN Predictions for dummy data:
[  1.17246063  11.78500671 435.57061379]


## Saving the Trained KNN Model

### Subtask:
Save the trained KNN model to a .pkl file.

In [None]:
import pickle

# Especifica la ruta completa del archivo en Google Drive
# Asegúrate de que la ruta sea correcta para tu estructura de carpetas
filename = '/content/drive/My Drive/Colab Notebooks/NasaSpace/knn_model.pkl'

# Guarda el modelo en el archivo
with open(filename, 'wb') as file:
    pickle.dump(knn_model, file)

print(f"El modelo KNN se ha guardado en '{filename}'")

El modelo KNN se ha guardado en '/content/drive/My Drive/Colab Notebooks/NasaSpace/knn_model.pkl'


## Summary:

### Data Analysis Key Findings

*   The KNN regression model achieved a Mean Squared Error (MSE) of approximately 20,418,275.47, a Root Mean Squared Error (RMSE) of approximately 4,518.66, and an R-squared (\R^2\$) value of approximately 0.7014 on the test set.
*   The model was successfully used to make predictions on fictitious data.
*   A post-processing step was applied to ensure that all predictions were non-negative, setting any negative predictions to 0.

### Insights or Next Steps

*   The \R^2\$ value of 0.7014 indicates that the KNN model explains about 70% of the variance in the target variable on the test data, which is a reasonable performance but could potentially be improved.
*   Further steps could involve hyperparameter tuning of the KNN model (e.g., experimenting with different values of `n_neighbors`) or exploring other regression algorithms to potentially improve model performance.
