# Health innovation funding landscape exploration: download the data

This notebook obtains data from 360 degree giving about the health funding landscape. It seeks to provide some empirical context for the scoping phase of the RWJF foundation. 

# Preamble

In [1]:
%matplotlib inline
#NB I open a standard set of directories

#Paths

#Get the top path
top_path = os.path.dirname(os.getcwd())

#Create the path for external data
ext_data = os.path.join(top_path,'data/external')

#Raw path (for html downloads)

raw_data = os.path.join(top_path,'data/raw')

#And external data
proc_data = os.path.join(top_path,'data/processed')

fig_path = os.path.join(top_path,'reports/figures')

#Get date for saving files
today = datetime.datetime.today()

today_str = "_".join([str(x) for x in [today.day,today.month,today.year]])

In [2]:
#Additional imports

import ratelim


# 1. Data collection

[360Giving](http://www.threesixtygiving.org/data/data-registry/) is a standard for open data about charitable giving in the UK. The 70 organisations participating in the programme make their data available in an standardised way. It is hoped that this open dataset will improve our understanding of the funding landscape in the UK, as well as its impacts and gaps. 

A json with metadata about each of the datasets, and a link for download is available for download from [this page](http://threesixtygiving.github.io/getdata/). We will loop through their keys, concatenate and start an exploratory analysis.

A preliminary glance at the data suggests some lack of standardisation (for example some funders include the country receiving the grant in the title while others have a specific field dedicated to this). Some fields that appear to eb present in all cases are, unsurprisingly: 

* project name, 
* description, 
* recipient, 
* period and 
* timeline.  

We can use them to start addresssing some of the questions above



In [3]:
#Let's get started

#First we will acquire the json file with the metadata using the following link

url = 'http://data.threesixtygiving.org/data.json'

#We use the get method to download the data and to parse into a json object
three60_metadata = requests.get(url).json()

Extract relevant information about each element in the json:

* Tithe
* Organisation name
* Organisation link
* Download url
* Coverage
* License

In [4]:
#We create a flat dictionary from the json file above and then create a df with it
#This is quite a generic problem. Maybe I should write a tool to do this. 

flat_dict = [{'title':x['title'],
              'org':x['publisher']['name'],
              'org_url':x['publisher']['website'],
              'file_url':x['distribution'][0]['downloadURL'],
              'license':x['license'],
              'modified':x['modified']} for x in three60_metadata]

three60_df = pd.DataFrame.from_dict(flat_dict,orient='columns')

#This is what it looks like
three60_df.head()

Unnamed: 0,file_url,license,modified,org,org_url,title
0,https://www.arcadiafund.org.uk/wp-content/uplo...,https://creativecommons.org/licenses/by/4.0/,2017-09-01T07:29:04+0000,Arcadia Fund,https://www.arcadiafund.org.uk/,Arcadia Fund grants awarded to July 2017
1,https://www.barrowcadbury.org.uk/wp-content/up...,https://creativecommons.org/licenses/by/4.0/,2018-02-05T14:27:52+0000,Barrow Cadbury Trust,http://www.barrowcadbury.org.uk/,Grants awarded 2012 to December 2017
2,http://downloads.bbc.co.uk/tv/pudsey/360_givin...,https://creativecommons.org/licenses/by/4.0/,2017-01-29T17:37:46+0000,BBC Children in Need,http://www.bbc.co.uk/corporate2/childreninneed,BBC Children in Need grants
3,https://www.biglotteryfund.org.uk/-/media/File...,http://www.nationalarchives.gov.uk/doc/open-go...,2017-07-27T10:33:57+0000,Big Lottery Fund,https://www.biglotteryfund.org.uk/,Big Lottery Fund - grants data 2015 to 2017
4,https://www.biglotteryfund.org.uk/-/media/File...,http://www.nationalarchives.gov.uk/doc/open-go...,2018-02-19T11:53:07+0000,Big Lottery Fund,https://www.biglotteryfund.org.uk/,Big Lottery Fund - grants data 2017-18 year-to...


There are several organisations with more than one file. This seems to reflect different funding periods

In [57]:
three60_df.org.value_counts()[:10]

Oxfordshire Community Foundation                5
The Wolfson Foundation                          3
Scottish Council for Voluntary Organisations    3
Tudor Trust                                     3
LandAid Charitable Trust                        3
Big Lottery Fund                                3
Lankelly Chase Foundation                       3
Pears Foundation                                2
Trafford Housing Trust                          2
Joseph Rowntree Foundation                      2
Name: org, dtype: int64

In [65]:
#Who is in the data?
print("\n".join([x for x in sorted(set(three60_df.org))]))


Arcadia Fund
BBC Children in Need
Barrow Cadbury Trust
Big Lottery Fund
Birmingham City Council
Blagrave Trust
Cabinet Office
Calouste Gulbenkian Foundation (UK Branch)
Cheshire Community Foundation
City Bridge Trust
Co-operative Group
Comic Relief
Community Foundation Tyne & Wear and Northumberland
Community Foundation for Surrey
Dunhill Medical Trust
Equity Foundation
Esmee Fairbairn Foundation
Essex Community Foundation
Gatsby Charitable Foundation
Greenham Common Trust
Henry Smith Charity
Indigo Trust
Joseph Rowntree Charitable Trust
Joseph Rowntree Foundation
LandAid Charitable Trust
Lankelly Chase Foundation
Lloyd's Register Foundation
Lloyds Bank Foundation
London Borough of Barnet
London Catalyst
London Councils
Macc
Millfield House Foundation
Nationwide Foundation
Nesta
Northern Rock Foundation
One Manchester
Oxford City Council
Oxfordshire Community Foundation
Paul Hamlyn Foundation
Pears Foundation
Power to Change
Quartet Community Foundation
Quixote Foundation
R S Macdonald

In [66]:
#How many organisations?
#71 files!
len(set(three60_df.org))

71

In [234]:
def get_file_type_string(request):
    '''
    This function takes the return from a webpage objec and returns a string where we look for the file type
    
    '''
    
    #Extract the url. This will often contain the file extension
    text = request.url
    
    #Also add metadata from the get, in case there was no file extension:
    if 'Content-Disposition' in request.headers:
        text = text + ' '+request.headers['Content-Disposition']
        
    return(text)
        


@ratelim.patient(5,10)
def get_360_file(url):
    '''
    This function downloads each file in the 360 degree data. We mostly create it to decorate with the rate limiter,
    which slows down the pace at which we download files from 360.
    
    '''
    
    #Different data sources have different formatrs so we have to work with that as well.
    
    #Get the file
    request = requests.get(url)
    
    #If the status code is 200, parse etc.
    if request.status_code==200:
        
            file_type_string = get_file_type_string(request)
        
        #The parsing depends on the type of file. We get the type of file from the header or the url name
        
            #This takes ages with large files.
            if '.csv' in file_type_string:
                #We need to stream the text into the csv
                table = pd.read_csv(io.StringIO(request.text))
            
            elif '.xls' in file_type_string:
                #Excel is a bit different
                with io.BytesIO(request.content) as fh:
                    table = pd.io.excel.read_excel(fh, sheetname=0)

            elif '.json' in file_type_string:
                #There is even one download with json!
                table = pd.DataFrame.from_dict(request.json()['grants'])

            return(table)
    
    else:
        #error = requests.get(url).error
        return(request.status_code)
    
    

In [289]:
#This loops over the urls we have and puts them in a container. When this doesn't work,
#it returns an error we can check later.

t60_container = []

for url in three60_df['file_url']:
    print(url)
    try:
        file = get_360_file(url)
        t60_container.append(file)
    except:
        t60_container.append('error')

https://www.arcadiafund.org.uk/wp-content/uploads/2017/07/Arcadia-grants-360Giving-28-Jul-2017.xlsx


  return func(*args, **kwargs)


https://www.barrowcadbury.org.uk/wp-content/uploads/2018/02/Copy-of-2017-12-360-Giving-until-2017-12-revised.xlsx
http://downloads.bbc.co.uk/tv/pudsey/360_giving_data_02102016.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/open_data_2015_2017.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/BLFOpenData17-18.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/BLFOpenData_2004_2015_V41.csv
https://data.birmingham.gov.uk/dataset/bb896f0b-10d7-403d-bad4-cc147349c380/resource/6ff023e2-947a-4eb9-bd67-0cdd2c7163dc/download/ssystemsgovernancetransparencygrants360-giving-bcc-data_2014-17-v2.xlsx
https://www.blagravetrust.org/wp-content/uploads/2018/02/360G-blagravetrust-2017.xlsx
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/663589/GGIS_Grant_Awards_2016_to_2017_2017-10-27_1621.xlsx
https://s3-eu-central-1.amazonaws.com/content.gulbenkian.pt/wp-conte

In [297]:
#How many errors

errors = [(y,x) for x,y in zip(three60_df['file_url'],t60_container) if type(y)!=pd.core.frame.DataFrame]

errors

[('error',
  'http://www.equityfoundation.co.uk/wp-content/uploads/2016/11/360-Upload-Jan-16-July-2017-1.xlsx'),
 (403,
  'https://3p50ut4bws5s2uzhmycc4t21-wpengine.netdna-ssl.com/wp-content/uploads/2018/03/OCF-Grants_FY17-18_21mar.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/04/OCF-Grants_FY13-14-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/10/OCF-Grants_FY15-16-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2016/12/OCF-Grants_FY14-15-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/10/OCF-Grants_FY16-17-final.xlsx'),
 (403,
  'http://www.powertochange.org.uk/wp-content/uploads/2017/07/Power-to-Change-Grants-Data-2015-2016.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/Fixed_2017-04-28_Tudor_Trust_grants_01-04-16_-_31-03-17.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/2018-01-04_grants_01-04-13_to_31-03-16_FNL_revised.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/2018-01-02_data

Seems to be 403 errors in all cases, with one exception where there is a problem with the formatting. The 403 files can be downloaded manually or using urllib and then loaded but I'm going to leave it for now.

TODO: Fix errors above

In [360]:
#Pickle the file 

with open(ext_data+'/{date}_file_download.p','wb') as outfile:
    pickle.dump(t60_container,outfile)