# Download survey GeoTiffs

New, faster method for downloading DHS cluster images! Based on [this blog post by Noel Gorelick](https://gorelick.medium.com/fast-er-downloads-a2abd512aa26).

Adapted from code provided by Markus Pettersson.

Import, authenticate and initialize the earth-engine library

In [1]:
import ee

ee.Authenticate()

# Initialize the Google Earth Engine API with the high volume end-point.
# See https://developers.google.com/earth-engine/cloud/highvolume
ee.Initialize(opt_url='https://earthengine-highvolume.googleapis.com')

Enter verification code: 4/1AfJohXkW3qJYfpCjh-HcCGiFaW4VcMvxfyecX6ufsyKeBb1tnsMr17E1-5E

Successfully saved authorization token.


In [2]:
# Import other libraries
import pandas as pd
import os
import satellite_sampling_annual_v3
import datetime

Read the csv file with survey points

In [3]:
interim_data_dir = '/mimer/NOBACKUP/groups/globalpoverty1/cindy/eoml_ch_wb/data/interim'
dhs_cluster_file_path = os.path.join(interim_data_dir, 'dhs_est_iwi.csv')
df = pd.read_csv(dhs_cluster_file_path)
df.head()

Unnamed: 0,country,survey_start_year,year,lat,lon,households,rural,iwi,dhs_id,image_file,...,iwi_1996_1998_est,iwi_1999_2001_est,iwi_2002_2004_est,iwi_2005_2007_est,iwi_2008_2010_est,iwi_2011_2013_est,iwi_2014_2016_est,iwi_2017_2019_est,image_file_annual,image_file_5k_3yr
0,south_africa,2016,2016,-34.463232,19.542468,6,1,70.723295,48830,./data/dhs_tifs/south_africa_2016/00743.tif,...,33.911133,43.969727,38.295898,33.579102,32.757568,38.330078,44.604492,49.267578,./data/dhs_tifs_annual/south_africa_2016/00000...,./data/dhs_tifs_5k_3yr/south_africa_2016/00000...
1,south_africa,2016,2016,-34.418873,19.188926,11,0,76.798705,48781,./data/dhs_tifs/south_africa_2016/00694.tif,...,56.29883,59.228516,60.98633,63.51563,66.223145,66.45508,66.137695,64.501953,./data/dhs_tifs_annual/south_africa_2016/00001...,./data/dhs_tifs_5k_3yr/south_africa_2016/00001...
2,south_africa,2016,2016,-34.412835,19.178965,4,0,81.053723,48828,./data/dhs_tifs/south_africa_2016/00741.tif,...,54.44336,58.71582,60.419923,63.03711,66.430664,65.934247,66.186523,64.25781,./data/dhs_tifs_annual/south_africa_2016/00002...,./data/dhs_tifs_5k_3yr/south_africa_2016/00002...
3,south_africa,2016,2016,-34.292107,19.563813,6,1,72.76688,48787,./data/dhs_tifs/south_africa_2016/00700.tif,...,20.300293,25.082397,27.207032,27.719727,26.94702,34.114584,36.865234,42.041016,./data/dhs_tifs_annual/south_africa_2016/00003...,./data/dhs_tifs_5k_3yr/south_africa_2016/00003...
4,south_africa,2016,2016,-34.1875,22.113079,3,0,77.864113,48756,./data/dhs_tifs/south_africa_2016/00669.tif,...,49.617514,48.321533,53.23242,56.865233,65.01465,65.65755,72.90039,67.529297,./data/dhs_tifs_annual/south_africa_2016/00004...,./data/dhs_tifs_5k_3yr/south_africa_2016/00004...


Split the dataframe into each country-year combination:

In [4]:
surveys_with_dfs = [(survey, survey_df.reset_index(drop=True)) for survey, survey_df in 
                    df.groupby(['country', 'year'])]

Function for checking if sample is already downloaded, in case the script needs to be restarted for some reason

In [7]:
def check_if_downloaded(row, save_dir, min_file_size=1000000):
    file_name = f'{row.name:05d}.tif'
    file_path = os.path.join(save_dir, file_name)
    
    # Check if file exists and is larger than min_file_size
    return os.path.isfile(file_path) and (os.stat(file_path).st_size > min_file_size)

Download each survey from Google Earth Engine

In [None]:
for survey, survey_df in surveys_with_dfs:
    country, year = survey
    print(f'Downloading images for {country}-{year}...'+
        datetime.datetime.now().strftime("%d.%b %Y %H:%M:%S"))
       
    data_dir = '/mimer/NOBACKUP/groups/globalpoverty1/cindy/eoml_ch_wb/data/'    
    save_dir = os.path.join(data_dir, f'dhs_tifs_annual_v3/{country}_{year}')        
           
    # Check if survey is already fully/partially downloaded
    if os.path.exists(save_dir):
        is_downloaded = survey_df.apply(lambda row: check_if_downloaded(row, save_dir), axis=1)
        samples_to_download = survey_df[~is_downloaded]
    else:
        os.makedirs(save_dir)
        samples_to_download = survey_df
    
    # If there are no samples to download for this survey, continue to next one
    if len(samples_to_download) == 0:
        continue
    
    satellite_sampling_annual_v3.export_images(samples_to_download, save_dir, span_length=1)

Downloading images for angola-2006...19.Oct 2023 03:31:47
Downloading images for benin-1996...19.Oct 2023 03:31:47
Downloading images for burkina_faso-1999...19.Oct 2023 03:31:47


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for burundi-2010...19.Oct 2023 03:32:11
Downloading images for cameroon-2004...19.Oct 2023 03:32:11


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for central_african_republic-1995...19.Oct 2023 03:33:06
Downloading images for chad-2014...19.Oct 2023 03:33:06


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for comoros-2012...19.Oct 2023 03:40:40
Downloading images for democratic_republic_of_congo-2007...19.Oct 2023 03:40:40
Downloading images for egypt-1996...19.Oct 2023 03:40:40
Downloading images for eswatini-2006...19.Oct 2023 03:40:40
Downloading images for ethiopia-2000...19.Oct 2023 03:40:40


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for gabon-2012...19.Oct 2023 03:41:27
Downloading images for ghana-1999...19.Oct 2023 03:41:28
Downloading images for guinea-1999...19.Oct 2023 03:41:28
Downloading images for ivory_coast-1999...19.Oct 2023 03:41:28
Downloading images for kenya-2003...19.Oct 2023 03:41:28
Downloading images for lesotho-2004...19.Oct 2023 03:41:28
Downloading images for liberia-2008...19.Oct 2023 03:41:28
Downloading images for madagascar-1997...19.Oct 2023 03:41:28
Downloading images for malawi-2000...19.Oct 2023 03:41:28
Downloading images for mali-1996...19.Oct 2023 03:41:28


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for morocco-2003...19.Oct 2023 03:42:32
Downloading images for mozambique-2011...19.Oct 2023 03:42:32
Downloading images for namibia-2000...19.Oct 2023 03:42:32


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for niger-1998...19.Oct 2023 03:44:37


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for nigeria-2003...19.Oct 2023 03:48:03


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for rwanda-2005...19.Oct 2023 03:49:10
Downloading images for senegal-1997...19.Oct 2023 03:49:10


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for sierra_leone-2008...19.Oct 2023 03:49:31
Downloading images for south_africa-2016...19.Oct 2023 03:49:31


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for tanzania-1999...19.Oct 2023 03:50:26
Downloading images for togo-1998...19.Oct 2023 03:50:26
Downloading images for uganda-2000...19.Oct 2023 03:50:26


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  survey_df['geometry'] = survey_df.apply(lambda row: ee.Geometry.Point([row['lon'], row['lat']]), axis=1)


Downloading images for zambia-2007...19.Oct 2023 03:51:37
