# Introduction

Some handy examples showing how to automatically download files from the web and Google drive. The examples chosen are for the Save the Children stream of the DataKind 2022 DataDive.

- [project brief](https://docs.google.com/document/d/1TQ2TiGK_k8KEIUPzVb3ZSKaxzEBP9s9ZJBI2U_HeQ6U/edit#)
- [Google drive with data](https://drive.google.com/drive/folders/1G_CAhpb0xV9zRrV-T5ngA03x-IZ4tGKz)

## Setup

In [12]:
# If you don't have these installed, uncomment these lines and run
#%pip install gdown
#%pip install chardet

Collecting chardet
  Downloading chardet-5.1.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: chardet
Successfully installed chardet-5.1.0
Note: you may need to restart the kernel to use updated packages.


In [13]:
import gdown
import os
import pandas as pd
import chardet

## Examples
### Downloading a single file from GDrive

In [5]:
url_demo_data = "https://docs.google.com/spreadsheets/d/1i41eQRgQatGbchbq2dcL6r9AVyndTSrl/edit?usp=share_link&ouid=106808949113099347741&rtpof=true&sd=true"
filename_demo_data = "kenya-3w-final-list-2017.xslx"
gdown.download(url_demo_data, filename_demo_data, quiet=False)

Downloading...
From: https://docs.google.com/spreadsheets/d/1i41eQRgQatGbchbq2dcL6r9AVyndTSrl/edit?usp=share_link&ouid=106808949113099347741&rtpof=true&sd=true
To: /Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/kenya-3w-final-list-2017.xslx
295kB [00:00, 33.8MB/s]


'kenya-3w-final-list-2017.xslx'

### Downloading a folder from gdrive

Beware, 50 files maximum.

In [9]:
folder_url = "https://drive.google.com/drive/folders/1G_CAhpb0xV9zRrV-T5ngA03x-IZ4tGKz?usp=share_link"
gdown.download_folder(folder_url, quiet=True, use_cookies=False)


['/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/dbo-foodconsumptionscores.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/fts_incoming_funding_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/fts_requirements_funding_cluster_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/fts_requirements_funding_covid_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/fts_requirements_funding_globalcluster_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/fts_requirements_funding_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/health_ken.csv',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/hotosm_ken_roads_lines_shp.zip',
 '/Users/matthewharris/Desktop/git/STC_DataDive_Dec22/notebooks/dividor/Kenya/KE_202207.zip',
 

## Reading files in a directory to pandas

This snippet will read all csv and xls files in a directory into pandas dataframes, create a list of them.

Just an example, needs a bit more logic to handle xlsx sheets which hold multiple tables, blank rows etc.

In [33]:
dir = "./Kenya"

# Array of pandas dataframes
dfs = {}

for f in os.listdir(dir):
    f = f"{dir}/{f}"
    if '.csv' in f:
        print(f"Loading csv file {f}")
        # Detect encoding
        with open(f, 'rb') as rawdata:
            r = chardet.detect(rawdata.read(100000))
        dfs[f] = pd.read_csv(f, encoding=r['encoding'])
    if '.xlsx' in f:
        print(f"Leading {f} ...")
        # Read all sheets in
        sheet_to_df_map = pd.read_excel(f, sheet_name=None)
        for sheet in sheet_to_df_map:
            dfs[f"{f} - {sheet}"] =  sheet_to_df_map[sheet]


Loading csv file ./Kenya/fts_requirements_funding_covid_ken.csv
Loading csv file ./Kenya/dbo-foodconsumptionscores.csv
Leading ./Kenya/per-capita-food-consumption-2017-and-2018.xlsx ...
Leading ./Kenya/retail-prices-for-dry-beans-2019-per-kg.xlsx ...
Leading ./Kenya/value-of-recorded-marketed-agricultural-production-at-current-prices-2014-2018.xlsx ...
Loading csv file ./Kenya/wfp_food_prices_ken.csv
Loading csv file ./Kenya/fts_requirements_funding_cluster_ken.csv
Loading csv file ./Kenya/fts_incoming_funding_ken.csv
Loading csv file ./Kenya/health_ken.csv
Leading ./Kenya/ken_adminboundaries_tabulardata.xlsx ...
Loading csv file ./Kenya/fts_requirements_funding_ken.csv
Leading ./Kenya/ken_admpop_2019.xlsx ...
Loading csv file ./Kenya/ken_admpop_adm1_2019.csv
Loading csv file ./Kenya/ken_admpop_adm0_2019.csv
Leading ./Kenya/Kenya - IPC Analysis 2017-2022.xlsx ...
Leading ./Kenya/retail-prices-for-dry-beans-2018-per-kg.xlsx ...
Loading csv file ./Kenya/fts_requirements_funding_globalclu

  warn(msg)
  warn(msg)


In [34]:
print(f"\n\nHere are the shapes of the dataframes we just loaded from directory {dir} ... \n\n")
for key in dfs:
    df = dfs[key]
    print(f"{key} >> {df.shape}")

print("Done")



Here are the shapes of the dataframes we just loaded from directory ./Kenya ... 


./Kenya/fts_requirements_funding_covid_ken.csv >> (3, 13)
./Kenya/dbo-foodconsumptionscores.csv >> (885, 15)
./Kenya/per-capita-food-consumption-2017-and-2018.xlsx - Sheet1 >> (21, 4)
./Kenya/retail-prices-for-dry-beans-2019-per-kg.xlsx - Prices for Dry Beans, 2019 >> (101, 4)
./Kenya/value-of-recorded-marketed-agricultural-production-at-current-prices-2014-2018.xlsx - Sheet1 >> (32, 8)
./Kenya/wfp_food_prices_ken.csv >> (14451, 14)
./Kenya/fts_requirements_funding_cluster_ken.csv >> (140, 12)
./Kenya/fts_incoming_funding_ken.csv >> (183, 37)
./Kenya/health_ken.csv >> (9412, 6)
./Kenya/ken_adminboundaries_tabulardata.xlsx - Admin2 >> (290, 16)
./Kenya/ken_adminboundaries_tabulardata.xlsx - Admin1 >> (47, 14)
./Kenya/ken_adminboundaries_tabulardata.xlsx - Admin0 >> (1, 12)
./Kenya/fts_requirements_funding_ken.csv >> (47, 12)
./Kenya/ken_admpop_2019.xlsx - ken_admpop_ADM0_2019 >> (1, 71)
./Kenya/ken_admp