<a href="https://colab.research.google.com/github/floranuta/Data_Circle/blob/main/notebooks/Task13_GeospatialAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Task 1.3: Geospatial Analysis
- **Ticket 1.3.1**: Create maps of water pump locations colored by functionality status
  - Use geopandas/folium to visualize pump locations on Tanzania map
  - Analyze geographic clusters of functional/non-functional pumps
  
- **Ticket 1.3.2**: Analyze regional patterns in water pump functionality
  - Create visualizations showing functionality rates by region/district
  - Identify areas with unusually high failure rates
  
- **Ticket 1.3.3**: Investigate relationships between geography and other features
  - Analyze how water source types vary by region
  - Explore relationships between elevation (gps_height) and functionality
  
- **Ticket 1.3.4**: Create geospatial features
  - Calculate distances to nearest city/population center if data available
  - Generate region-level aggregated statistics


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd

In [2]:
from google.colab import drive
drive.mount('/content/drive')
# Path of the file to read
#csv_file_path = "D:/REDI/Data_Circle/data/training_set_values.csv"
csv_file_path = "/content/drive/MyDrive/Colab Notebooks/training_set_values.csv"
# Fill in the line below to read the file into a variable home_data
pump_data = pd.read_csv(csv_file_path)

Mounted at /content/drive


In [3]:
pump_data_labels=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/training_set_labels.csv")

In [4]:
mask_nan = pump_data["latitude"].isna() | pump_data["longitude"].isna()
nan_rows = pump_data[mask_nan]

print("Rows with NaN latitude or longitude:", len(nan_rows))
display(nan_rows.head())

Rows with NaN latitude or longitude: 0


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group


In [5]:
# Check how many rows have latitude=0 or longitude=0
mask_zero = (pump_data["latitude"] == 0) | (pump_data["longitude"] == 0)
zero_rows = pump_data[mask_zero]

print("Rows with 0 latitude or longitude:", len(zero_rows))
display(zero_rows.head())

Rows with 0 latitude or longitude: 1812


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
21,6091,0.0,10/02/2013,Dwsp,0,DWE,0.0,-2e-08,Muungano,0,...,unknown,unknown,unknown,unknown,unknown,shallow well,shallow well,groundwater,hand pump,hand pump
53,32376,0.0,01/08/2011,Government Of Tanzania,0,Government,0.0,-2e-08,Polisi,0,...,unknown,unknown,unknown,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
168,72678,0.0,30/01/2013,Wvt,0,WVT,0.0,-2e-08,Wvt Tanzania,0,...,other,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
177,56725,0.0,17/01/2013,Netherlands,0,DWE,0.0,-2e-08,Kikundi Cha Wakina Mama,0,...,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other
253,13042,0.0,29/10/2012,Hesawa,0,DWE,0.0,-2e-08,Kwakisusi,0,...,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump


In [6]:
# IDs in pump_data but not in pump_data_labels
missing_ids = pump_data.loc[~pump_data["id"].isin(pump_data_labels["id"]), "id"]

print("Number of pump_data IDs without label:", len(missing_ids))
print(missing_ids.head())

Number of pump_data IDs without label: 0
Series([], Name: id, dtype: int64)
