# Health innovation funding landscape exploration

This notebook explores data from 360 degree giving about the health funding landscape. It seeks to provide some empirical context for the scoping phase of the RWJF foundation. Some preliminary questions we would like to address:

* What are the levels of funding for health innovation (as defined in the project) in the UK
* How has funding evolved over time?
* What's its geography?
* Who is supporting health innovation?
* What are its topics?
* What are example projects?
* What is Nesta doing?

We will at some point seek to compare the analysis of this funding landscape with the situation in the USA based on an analysis of the activities of RWJF.


# Preamble

In [1]:
%matplotlib inline
#NB I open a standard set of directories

#Paths

#Get the top path
top_path = os.path.dirname(os.getcwd())

#Create the path for external data
ext_data = os.path.join(top_path,'data/external')

#Raw path (for html downloads)

raw_data = os.path.join(top_path,'data/raw')

#And external data
proc_data = os.path.join(top_path,'data/processed')

fig_path = os.path.join(top_path,'reports/figures')

#Get date for saving files
today = datetime.datetime.today()

today_str = "_".join([str(x) for x in [today.day,today.month,today.year]])

In [67]:
#Additional imports

import ratelim


# 1. Data collection

[360Giving](http://www.threesixtygiving.org/data/data-registry/) is a standard for open data about charitable giving in the UK. The 70 organisations participating in the programme make their data available in an standardised way. It is hoped that this open dataset will improve our understanding of the funding landscape in the UK, as well as its impacts and gaps. 

A json with metadata about each of the datasets, and a link for download is available for download from [this page](http://threesixtygiving.github.io/getdata/). We will loop through their keys, concatenate and start an exploratory analysis.

A preliminary glance at the data suggests some lack of standardisation (for example some funders include the country receiving the grant in the title while others have a specific field dedicated to this). Some fields that appear to eb present in all cases are, unsurprisingly: 

* project name, 
* description, 
* recipient, 
* period and 
* timeline.  

We can use them to start addresssing some of the questions above



In [13]:
#Let's get started

#First we will acquire the json file with the metadata using the following link

url = 'http://data.threesixtygiving.org/data.json'

#We use the get method to download the data and to parse into a json object
three60_metadata = requests.get(url).json()

Extract relevant information about each element in the json:

* Tithe
* Organisation name
* Organisation link
* Download url
* Coverage
* License

In [53]:
#We create a flat dictionary from the json file above and then create a df with it
#This is quite a generic problem. Maybe I should write a tool to do this. 

flat_dict = [{'title':x['title'],
              'org':x['publisher']['name'],
              'org_url':x['publisher']['website'],
              'file_url':x['distribution'][0]['downloadURL'],
              'license':x['license'],
              'modified':x['modified']} for x in three60_metadata]

three60_df = pd.DataFrame.from_dict(flat_dict,orient='columns')

#This is what it looks like
three60_df.head()

Unnamed: 0,file_url,license,modified,org,org_url,title
0,https://www.arcadiafund.org.uk/wp-content/uplo...,https://creativecommons.org/licenses/by/4.0/,2017-09-01T07:29:04+0000,Arcadia Fund,https://www.arcadiafund.org.uk/,Arcadia Fund grants awarded to July 2017
1,https://www.barrowcadbury.org.uk/wp-content/up...,https://creativecommons.org/licenses/by/4.0/,2018-02-05T14:27:52+0000,Barrow Cadbury Trust,http://www.barrowcadbury.org.uk/,Grants awarded 2012 to December 2017
2,http://downloads.bbc.co.uk/tv/pudsey/360_givin...,https://creativecommons.org/licenses/by/4.0/,2017-01-29T17:37:46+0000,BBC Children in Need,http://www.bbc.co.uk/corporate2/childreninneed,BBC Children in Need grants
3,https://www.biglotteryfund.org.uk/-/media/File...,http://www.nationalarchives.gov.uk/doc/open-go...,2017-07-27T10:33:57+0000,Big Lottery Fund,https://www.biglotteryfund.org.uk/,Big Lottery Fund - grants data 2015 to 2017
4,https://www.biglotteryfund.org.uk/-/media/File...,http://www.nationalarchives.gov.uk/doc/open-go...,2018-02-19T11:53:07+0000,Big Lottery Fund,https://www.biglotteryfund.org.uk/,Big Lottery Fund - grants data 2017-18 year-to...


There are several organisations with more than one file. This seems to reflect different funding periods

In [57]:
three60_df.org.value_counts()[:10]

Oxfordshire Community Foundation                5
The Wolfson Foundation                          3
Scottish Council for Voluntary Organisations    3
Tudor Trust                                     3
LandAid Charitable Trust                        3
Big Lottery Fund                                3
Lankelly Chase Foundation                       3
Pears Foundation                                2
Trafford Housing Trust                          2
Joseph Rowntree Foundation                      2
Name: org, dtype: int64

In [65]:
#Who is in the data?
print("\n".join([x for x in sorted(set(three60_df.org))]))


Arcadia Fund
BBC Children in Need
Barrow Cadbury Trust
Big Lottery Fund
Birmingham City Council
Blagrave Trust
Cabinet Office
Calouste Gulbenkian Foundation (UK Branch)
Cheshire Community Foundation
City Bridge Trust
Co-operative Group
Comic Relief
Community Foundation Tyne & Wear and Northumberland
Community Foundation for Surrey
Dunhill Medical Trust
Equity Foundation
Esmee Fairbairn Foundation
Essex Community Foundation
Gatsby Charitable Foundation
Greenham Common Trust
Henry Smith Charity
Indigo Trust
Joseph Rowntree Charitable Trust
Joseph Rowntree Foundation
LandAid Charitable Trust
Lankelly Chase Foundation
Lloyd's Register Foundation
Lloyds Bank Foundation
London Borough of Barnet
London Catalyst
London Councils
Macc
Millfield House Foundation
Nationwide Foundation
Nesta
Northern Rock Foundation
One Manchester
Oxford City Council
Oxfordshire Community Foundation
Paul Hamlyn Foundation
Pears Foundation
Power to Change
Quartet Community Foundation
Quixote Foundation
R S Macdonald

In [66]:
#How many organisations?
#71 files!
len(set(three60_df.org))

71

In [234]:
def get_file_type_string(request):
    '''
    This function takes the return from a webpage objec and returns a string where we look for the file type
    
    '''
    
    #Extract the url. This will often contain the file extension
    text = request.url
    
    #Also add metadata from the get, in case there was no file extension:
    if 'Content-Disposition' in request.headers:
        text = text + ' '+request.headers['Content-Disposition']
        
    return(text)
        


@ratelim.patient(5,10)
def get_360_file(url):
    '''
    This function downloads each file in the 360 degree data. We mostly create it to decorate with the rate limiter,
    which slows down the pace at which we download files from 360.
    
    '''
    
    #Different data sources have different formatrs so we have to work with that as well.
    
    #Get the file
    request = requests.get(url)
    
    #If the status code is 200, parse etc.
    if request.status_code==200:
        
            file_type_string = get_file_type_string(request)
        
        #The parsing depends on the type of file. We get the type of file from the header or the url name
        
            #This takes ages with large files.
            if '.csv' in file_type_string:
                #We need to stream the text into the csv
                table = pd.read_csv(io.StringIO(request.text))
            
            elif '.xls' in file_type_string:
                #Excel is a bit different
                with io.BytesIO(request.content) as fh:
                    table = pd.io.excel.read_excel(fh, sheetname=0)

            elif '.json' in file_type_string:
                #There is even one download with json!
                table = pd.DataFrame.from_dict(request.json()['grants'])

            return(table)
    
    else:
        #error = requests.get(url).error
        return(request.status_code)
    
    

In [289]:
#This loops over the urls we have and puts them in a container. When this doesn't work,
#it returns an error we can check later.

t60_container = []

for url in three60_df['file_url']:
    print(url)
    try:
        file = get_360_file(url)
        t60_container.append(file)
    except:
        t60_container.append('error')

https://www.arcadiafund.org.uk/wp-content/uploads/2017/07/Arcadia-grants-360Giving-28-Jul-2017.xlsx


  return func(*args, **kwargs)


https://www.barrowcadbury.org.uk/wp-content/uploads/2018/02/Copy-of-2017-12-360-Giving-until-2017-12-revised.xlsx
http://downloads.bbc.co.uk/tv/pudsey/360_giving_data_02102016.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/open_data_2015_2017.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/BLFOpenData17-18.xlsx
https://www.biglotteryfund.org.uk/-/media/Files/Research%20Documents/aOpenDataFiles/BLFOpenData_2004_2015_V41.csv
https://data.birmingham.gov.uk/dataset/bb896f0b-10d7-403d-bad4-cc147349c380/resource/6ff023e2-947a-4eb9-bd67-0cdd2c7163dc/download/ssystemsgovernancetransparencygrants360-giving-bcc-data_2014-17-v2.xlsx
https://www.blagravetrust.org/wp-content/uploads/2018/02/360G-blagravetrust-2017.xlsx
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/663589/GGIS_Grant_Awards_2016_to_2017_2017-10-27_1621.xlsx
https://s3-eu-central-1.amazonaws.com/content.gulbenkian.pt/wp-conte

In [297]:
#How many errors

errors = [(y,x) for x,y in zip(three60_df['file_url'],t60_container) if type(y)!=pd.core.frame.DataFrame]

errors

[('error',
  'http://www.equityfoundation.co.uk/wp-content/uploads/2016/11/360-Upload-Jan-16-July-2017-1.xlsx'),
 (403,
  'https://3p50ut4bws5s2uzhmycc4t21-wpengine.netdna-ssl.com/wp-content/uploads/2018/03/OCF-Grants_FY17-18_21mar.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/04/OCF-Grants_FY13-14-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/10/OCF-Grants_FY15-16-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2016/12/OCF-Grants_FY14-15-mod.xlsx'),
 (403,
  'http://oxfordshire.org/wp-content/uploads/2017/10/OCF-Grants_FY16-17-final.xlsx'),
 (403,
  'http://www.powertochange.org.uk/wp-content/uploads/2017/07/Power-to-Change-Grants-Data-2015-2016.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/Fixed_2017-04-28_Tudor_Trust_grants_01-04-16_-_31-03-17.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/2018-01-04_grants_01-04-13_to_31-03-16_FNL_revised.xlsx'),
 (403,
  'http://tudortrust.org.uk/assets/file/2018-01-02_data

Seems to be 403 errors in all cases, with one exception where there is a problem with the formatting. The 403 files can be downloaded manually or using urllib and then loaded but I'm going to leave it for now.

TODO: Fix errors above

In [589]:
#Pickle the file 

with open(ext_data+'/{date}_file_download.p'.format(date=today_str),'wb') as outfile:
    pickle.dump(t60_container,outfile)

### Initial exploration

In [339]:
#This list contains the dfs we have managed to download
t60_dfs = [x for x in t60_container if type(x)==pd.core.frame.DataFrame]

In [361]:
#What columns are shared across datasets?
t60_columns = [set([c.lower() for c in x.columns]) for x in t60_dfs]

#This gives the intersection of all the sets
u = set.intersection(*t60_columns)
u

{'currency', 'description', 'title'}

Only 3 fields are shared. We will look for the columns which appear most often in the dataset

In [335]:
#Now let's see what fields appear most frequently

def flatten_list(my_list):
    '''
    Turns a nested list into a flat list
    
    '''
    
    flat = [x for el in my_list for x in el]
    
    return(flat)

In [342]:
#This list comprehension gives us a list of column names for each df (we lower them)
t60_column_names = [[name.lower() for name in x.columns] for x in t60_dfs]

#And here are their frequencies - remember we 
column_freq = pd.Series(flatten_list(t60_column_names)).value_counts()

column_freq[:20]

currency                           86
title                              86
description                        86
award date                         82
amount awarded                     82
identifier                         82
funding org:identifier             76
recipient org:name                 75
funding org:name                   75
recipient org:identifier           75
last modified                      69
recipient org:charity number       67
planned dates:duration (months)    53
recipient org:company number       50
grant programme:title              48
planned dates:start date           47
planned dates:end date             46
beneficiary location:name          44
recipient org:postal code          41
recipient org:city                 40
dtype: int64

In [353]:
#Let's see what's the coverage with different shared column names.

def extract_shared_variables(variable_names,df_list):
    '''
    Takes a list of fields and returns a concatenated df with them.
    
    '''
    
    df_container = []
    
    for x in df_list:
        x.columns = [f.lower() for f in x.columns]
        
        x_subset = x[[var for var in x.columns if var in variable_names]]
        df_container.append(x_subset)
        
    df_concat = pd.concat(df_container,axis=0)
    
    return(df_concat)
    
        



In [363]:
#And here is the shared df!
t60_df = extract_shared_variables(column_freq[:20].index,t60_dfs)

In [381]:
#How many missing values per variable?

#This list comprehension loops over the columns and gives us the columns above a certain threshold
fields_above_thres = [val[0] for val in [(col,100*np.sum(t60_df[col].isna())/len(t60_df)) for col in t60_df.columns] if
                      val[1]<10]

fields_above_thres

['amount awarded',
 'award date',
 'currency',
 'description',
 'funding org:identifier',
 'funding org:name',
 'identifier',
 'recipient org:identifier',
 'recipient org:name',
 'title']

These are the variables present in at least 10% of projects in the data. They contain a lost of the info
we want to answer the questions above, with the exception of place :-(

In [425]:
ts_df = t60_df[fields_above_thres]

# 2. Data analysis

The main goals is to identify and explore health innovation related projects. This definition has a domain aspect (the projects need to seek improvements in health outcomes) and a novelty aspect (they need to be new or different from what's done in the field). We will explore several strategies to get a handle on this. This includes:

1. Identify health projects

We will use data about project categories (which are available for some if not all projects) to train a model predicting if it is in health, and also to analyse the overlaps between projects in health and other domains.

2. Map activity inside health

Once we have a corpus of 'health' projects, we will classify them into finer categories using a third party taxonomy (eg disease areas and project types).

3. Find innovative projects

This is the least straightforward part. We are looking for novelty. This can be defined in different ways:

* Projects that mention innovation

* Projects involving innovative technologies (in this case would look for keywords based on some domain-based list of technologies or keywords.

* Projects similar to those sponsored by innovative organisations. Eg. Train a model on the Nesta data and look for similar projects

* Projects that bridge domains in unusual ways

* Projects that are unique in that they don't fall in existing clusters or form their own clusters (need to decide how to do the clustering).

* Projects with trending keywords (keywords that started appearing recently in the data)


...We will explore some of these options in the 360g data

### Identify health projects




The 'impact category' and domain variables are not generally present in the data. Initially I thought about using those projects where that information is available to train a predictive model we could then apply to the unlabelled data. Unfortunately vert few organisations provide this information, and in general their descriptions are too short. 

We will explore alternative strategies such as for example use the words with the highest tf idf in those 'health' categories, boosted with word embeddings


In [569]:
from nltk.corpus import stopwords

stop = stopwords.words('English')

In [570]:
impact_vars = ['impact category', 'primary issue']

t60h = extract_shared_variables(fields_above_thres+impact_vars,t60_dfs)


In [580]:
#Impact variables to focus on
impact_vars = ['impact category', 'primary issue','classifications:title']

#t60h means three sixty health
t60h = extract_shared_variables(fields_above_thres+impact_vars,t60_dfs).reset_index(drop=True)

#First some tidying up of variable names.

t60h.columns

t60h.columns = ['value','award_date','impact_1','currency','description','funder_id','funder_name','identifier',
               'impact_2','impact_3','recipient_id','recipient_name','title']

#Tokenise descriptions and remove stopwords

#Levels of this list comprehension: descriptions -> words in the description if words not in stopword list
#and np.nan if the description is a number

t60h['description_tokens'] = [
    [[w.lower() for w in x.split(" ") if w.lower() not in set(stop)] if type(x)==str else np.nan][0] for x in t60h.description]



In [588]:
#How will this work? 

#NOw we have a list of impact variables
impact_cats = set(t60h['impact_1'].dropna()) | set(t60h['impact_2'].dropna()) | set(t60h['impact_3'].dropna())

health_cats = {x for x in impact_cats if 'health' in x.lower()}

#Now we identify the projects in these categories

#We combine the categories in all the impact variables
t60h['impact_categories'] = [' '.join([str(x),
                                       str(y),
                                       str(z)]) for x,y,z in zip(t60h.impact_1,t60h.impact_2,t60h.impact_3)]

#Focus on the projects where we at least have one category of impact (we go down to 5k)
t60_labelled = t60h.loc[[x!='nan nan nan' for x in t60h.impact_categories]]

#Find the healthy ones
t60_labelled['has_health'] = [any(x in val for x in health_cats) for val in t60_labelled.impact_categories]


#Top words in each group
t60_labelled.groupby('has_health')['description_tokens'].apply(lambda x: pd.Series(
    [val for el in x for val in el]).value_counts()[:10])

#I don't think this is going to be very enlightening


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


has_health               
False       towards          2996
            costs            1490
            years'           1472
            project          1447
            people           1251
            three            1187
            running          1158
            support          1146
            disadvantaged     845
            children          810
True        towards           426
            people            352
            support           302
            costs             260
            project           209
            health            194
            mental            194
            running           171
            years'            169
            requested         136
Name: description_tokens, dtype: int64

In [None]:
# We will have to simplify things