Temporary notebook to run and manually test the Ookla feature generation function.

PHL cluster-level data is taken from [GDrive](https://drive.google.com/drive/u/0/folders/1N9vSFX05bDWGolfsRjTTBUxGtuVP-77s).

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import sys
sys.path.append("../../../")

import uuid

import numpy as np

import geopandas as gpd
import pandas as pd

from povertymapping import settings, ookla
from geowrangler.datasets.ookla import list_ookla_files




Load in temporary data for PHL clusters

In [3]:
def load_cluster_gdf(path):
    df = pd.read_csv(path)

    # Some of the coordinates in the data are invalid. This filters them out.
    df = df[(df.longitude>0)&(df.latitude>0)]

    # Create a GeoDataFrame from the longitude, latitude columns.
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.longitude, df.latitude), crs="epsg:4326")

    print(f"There are {len(gdf):,} clusters.")

    return gdf

In [60]:
GROUND_TRUTH_CSV = settings.DATA_DIR/"phl_dhs_cluster_level.csv"
gdf = load_cluster_gdf(GROUND_TRUTH_CSV)
gdf['geometry'] = gdf.geometry.buffer(0.01)

There are 1,213 clusters.



  gdf['geometry'] = gdf.geometry.buffer(0.01)


In [62]:
gdf.explore()

Try the Ookla feature generation function.

In [None]:
# %%time
# gdf_with_features = osm.add_osm_features(gdf, "philippines", cache_dir=settings.CACHE_DIR)
# gdf_with_features.head()

Sanity check that the internal helper function can successfully download and cache Ookla data for the different types (fixed, mobile) and years.

In [36]:
years = range(2019, 2023)

for year in years:
    ookla.download_ookla_year_data(type_="fixed", year=year, cache_dir=settings.CACHE_DIR)

2023-01-10 20:27:21.257 | INFO     | povertymapping.ookla:download_ookla_year_data:85 - Ookla Data: Number of available files for fixed and 2019: 4
2023-01-10 20:27:21.259 | INFO     | povertymapping.ookla:download_ookla_year_data:99 - Ookla Data: Cached data available for fixed and 2019 at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2019? True
2023-01-10 20:27:21.260 | INFO     | povertymapping.ookla:download_ookla_year_data:85 - Ookla Data: Number of available files for fixed and 2020: 4
2023-01-10 20:27:21.262 | INFO     | povertymapping.ookla:download_ookla_year_data:99 - Ookla Data: Cached data available for fixed and 2020 at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2020? True
2023-01-10 20:27:21.263 | INFO     | povertymapping.ookla:download_ookla_year_data:85 - Ookla Data: Number of available files for fixed and 2021: 4
2023-01-10 20:27:21.265 | INFO     | povertymapping.ookla:download_ookla_year_data

In [38]:
# Check that unavailable years will not be downloaded
ookla.download_ookla_year_data(type_="fixed", year=2017, cache_dir=settings.CACHE_DIR)

2023-01-10 20:29:37.553 | INFO     | povertymapping.ookla:download_ookla_year_data:103 - Ookla Data: Cached data available for fixed and 2017 at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2017? True


'/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2017'

In [49]:
# Inspect an example Ookla file
sample_file_dir = ookla.download_ookla_year_data(type_="fixed", year=2022, cache_dir=settings.CACHE_DIR)
os.listdir(sample_file_dir)
sample_file = os.path.join(sample_file_dir, os.listdir(sample_file_dir)[0])
sample_file
# pd.read_parquet(sample_file)

2023-01-10 20:54:07.809 | INFO     | povertymapping.ookla:download_ookla_year_data:92 - Ookla Data: Number of available files for fixed and 2022: 4
2023-01-10 20:54:07.810 | INFO     | povertymapping.ookla:download_ookla_year_data:106 - Ookla Data: Cached data available for fixed and 2022 at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2022? True


'/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2022/2022-04-01_performance_fixed_tiles.parquet'

In [64]:
ookla.get_OoklaFile('2022-04-01_performance_fixed_tiles.parquet')

OoklaQuarter(type='fixed', year='2022', quarter='2')

In [45]:
%%time

test_data_manager = ookla.OoklaDataManager(cache_dir=settings.CACHE_DIR)
test_df = test_data_manager.load_type_year_data(gdf, 'fixed', 2022, return_geometry=False)

2023-01-11 16:10:25.324 | DEBUG    | povertymapping.ookla:load_type_year_data:62 - Contents of data cache: []
2023-01-11 16:10:25.327 | INFO     | povertymapping.ookla:download_ookla_year_data:140 - Ookla Data: Number of available files for fixed and 2022: 4
2023-01-11 16:10:25.334 | INFO     | povertymapping.ookla:download_ookla_year_data:154 - Ookla Data: Cached data available for fixed and 2022 at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2022? True
2023-01-11 16:10:26.980 | DEBUG    | povertymapping.ookla:load_type_year_data:93 - Ookla data for aoi, fixed 2022 1 being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2022/2022-01-01_performance_fixed_tiles.parquet
2023-01-11 16:10:45.163 | DEBUG    | povertymapping.ookla:load_type_year_data:93 - Ookla data for aoi, fixed 2022 2 being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/data/data_cache/ookla/fixed/2022/2022-04-01_perform

CPU times: user 36.7 s, sys: 30.3 s, total: 1min 6s
Wall time: 56 s


In [46]:
test_df = test_data_manager.load_type_year_data(gdf, 'fixed', 2022, use_cache=False)

2023-01-11 16:11:21.816 | DEBUG    | povertymapping.ookla:load_type_year_data:62 - Contents of data cache: ['3cf3e92f002ba03463994a5baac16f3c']
2023-01-11 16:11:21.817 | DEBUG    | povertymapping.ookla:load_type_year_data:64 - Ookla data for aoi, fixed 2022 (key: 3cf3e92f002ba03463994a5baac16f3c) found in cache.


In [47]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2195 entries, 0 to 2194
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   quadkey     2195 non-null   object
 1   tile        2195 non-null   object
 2   avg_d_kbps  2195 non-null   int64 
 3   avg_u_kbps  2195 non-null   int64 
 4   avg_lat_ms  2195 non-null   int64 
 5   tests       2195 non-null   int64 
 6   devices     2195 non-null   int64 
 7   quarter     2195 non-null   int64 
dtypes: int64(6), object(2)
memory usage: 137.3+ KB


In [50]:
test_df.head()

Unnamed: 0,quadkey,tile,avg_d_kbps,avg_u_kbps,avg_lat_ms,tests,devices,quarter
0,1323011210311122,"POLYGON((121.97021484375 20.4527493625349, 121...",23811,4926,706,17,6,1
1,1323012132130333,"POLYGON((120.536499023438 18.1928251977332, 12...",69659,45538,33,17,5,1
2,1323012132131302,"POLYGON((120.56396484375 18.2032620195444, 120...",10289,9458,27,3,3,1
3,1323012132313233,"POLYGON((120.558471679688 18.0675346872031, 12...",63843,53523,20,39,22,1
4,1323012133023330,"POLYGON((120.662841796875 18.1562914028354, 12...",5383,777,122,3,2,1


In [51]:
test_df['quadkey'].value_counts()

1323011210311122    4
1323211103310020    4
1323211103302133    4
1323211103301322    4
1323211103301313    4
                   ..
1323013223300202    1
1323012313202232    1
1323012133232030    1
1323012313021113    1
1323300212332220    1
Name: quadkey, Length: 640, dtype: int64

In [36]:
test_df.groupby('quadkey').agg({'avg_d_kbps': ['mean', 'sum']})

Unnamed: 0_level_0,avg_d_kbps,avg_d_kbps
Unnamed: 0_level_1,mean,sum
quadkey,Unnamed: 1_level_2,Unnamed: 2_level_2
1323011210311122,30337.00,121348
1323012132130333,87810.75,351243
1323012132131302,26403.00,79209
1323012132313233,88988.00,355952
1323012132331101,33092.00,33092
...,...,...
1323302311002211,74252.75,297011
1323302320200202,61142.75,244571
1323302320200221,67206.75,268827
1323302320201300,80590.75,322363


In [53]:
test_df[['quadkey','tile']].drop_duplicates().reset_index(drop=True)

Unnamed: 0,quadkey,tile
0,1323011210311122,"POLYGON((121.97021484375 20.4527493625349, 121..."
1,1323012132130333,"POLYGON((120.536499023438 18.1928251977332, 12..."
2,1323012132131302,"POLYGON((120.56396484375 18.2032620195444, 120..."
3,1323012132313233,"POLYGON((120.558471679688 18.0675346872031, 12..."
4,1323012133023330,"POLYGON((120.662841796875 18.1562914028354, 12..."
...,...,...
635,1323211103313313,"POLYGON((123.041381835938 10.5904212882416, 12..."
636,1323211333300011,"POLYGON((123.590698242188 8.58102121564185, 12..."
637,1323213101310112,"POLYGON((122.991943359375 8.22780052615227, 12..."
638,1323300002220320,"POLYGON((123.77197265625 10.5418210946591, 123..."


## Test Feature Generation Over AOI

In [79]:
aoi = gdf.copy()
aoi = ookla.add_ookla_features(aoi, 'fixed', 2022, test_data_manager)
aoi = ookla.add_ookla_features(aoi, 'mobile', 2022, test_data_manager)
aoi.head()

2023-01-11 17:01:18.753 | DEBUG    | povertymapping.ookla:load_type_year_data:62 - Contents of data cache: ['3cf3e92f002ba03463994a5baac16f3c', '3e6ef48f4728534a77750bc09d3b24ab', 'b6aa128343ee480aca80ffb9389a3f81']
2023-01-11 17:01:18.755 | DEBUG    | povertymapping.ookla:load_type_year_data:64 - Ookla data for aoi, fixed 2022 (key: 3e6ef48f4728534a77750bc09d3b24ab) found in cache.
2023-01-11 17:01:19.769 | DEBUG    | povertymapping.ookla:load_type_year_data:62 - Contents of data cache: ['3cf3e92f002ba03463994a5baac16f3c', '3e6ef48f4728534a77750bc09d3b24ab', 'b6aa128343ee480aca80ffb9389a3f81']
2023-01-11 17:01:19.770 | DEBUG    | povertymapping.ookla:load_type_year_data:64 - Ookla data for aoi, mobile 2022 (key: b6aa128343ee480aca80ffb9389a3f81) found in cache.


Unnamed: 0,DHSCLUST,Wealth Index,DHSID,longitude,latitude,geometry,fixed_2022_mean_avg_d_kbps_mean,fixed_2022_mean_avg_u_kbps_mean,fixed_2022_mean_avg_lat_ms_mean,fixed_2022_mean_num_tests_mean,fixed_2022_mean_num_devices_mean,mobile_2022_mean_avg_d_kbps_mean,mobile_2022_mean_avg_u_kbps_mean,mobile_2022_mean_avg_lat_ms_mean,mobile_2022_mean_num_tests_mean,mobile_2022_mean_num_devices_mean
0,1,-31881.6087,PH201700000001,122.109807,6.674652,"POLYGON ((122.11981 6.67465, 122.11976 6.67367...",409.563869,114.480659,1.273992,0.083441,0.041512,1952.373452,614.62376,0.453908,0.017348,0.016809
1,2,-2855.375,PH201700000002,122.132027,6.662256,"POLYGON ((122.14203 6.66226, 122.14198 6.66128...",314.219563,132.421987,5.098473,1.666637,0.227078,1015.445032,312.259961,1.979489,1.147115,0.361082
2,3,-57647.04762,PH201700000003,122.179496,6.621822,"POLYGON ((122.18950 6.62182, 122.18945 6.62084...",759.218085,133.677421,2.412155,0.218157,0.054964,249.81895,126.967876,1.941354,0.628767,0.083579
3,4,-54952.66667,PH201700000004,122.137965,6.485298,"POLYGON ((122.14796 6.48530, 122.14792 6.48432...",49.031553,97.097382,12.924798,0.20404,0.078694,318.020993,358.79501,1.652021,0.11798,0.054017
5,6,-80701.69565,PH201700000006,121.916094,6.629457,"POLYGON ((121.92609 6.62946, 121.92605 6.62848...",,,,,,,,,,
