
<img src="http://www.nserc-crsng.gc.ca/_gui/wmms.gif" alt="Canada logo" align="right">

<br>

<img src="http://www.triumf.ca/sites/default/files/styles/gallery_large/public/images/nserc_crsng.gif?itok=H7AhTN_F" alt="NSERC logo" align="right" width = 90>



# Exploring NSERC Awards Data


Canada's [Open Government Portal](http://open.canada.ca/en) includes [NSERC Awards Data](http://open.canada.ca/data/en/dataset/c1b0f627-8c29-427c-ab73-33968ad9176e) from 1991 through 2015. The [2015](http://www.nserc-crsng.gc.ca/NSERC-CRSNG/FundingDecisions-DecisionsFinancement/ResearchGrants-SubventionsDeRecherche/ResultsGSC-ResultatsCSS_eng.asp?Year=2015) and [2016](http://www.nserc-crsng.gc.ca/NSERC-CRSNG/FundingDecisions-DecisionsFinancement/ResearchGrants-SubventionsDeRecherche/ResultsGSC-ResultatsCSS_eng.asp?Year=2016) data are also available separately as web archives. 

The awards data (in .csv format) were copied to an [Amazon Web Services S3 bucket](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html). This open Jupyter notebook starts an exploration of the NSERC investment portfolio during the 1995 -- 2015 epoch. The notebook assumes that you have your AWS keys setup in `~/.aws/credentials`. See the [boto3 docs](http://boto3.readthedocs.io/en/latest/guide/configuration.html) for more information on configuration of credentials. (If you'd like access to the data hosted on S3, please contact [James Colliander](http://colliand.com).)

> **Acknowledgement:** I thank [Ian Allison](https://github.com/ianabc) of the [Pacific Institute for the Mathematical Sciences](http://www.pims.math.ca/) for building the [JupyterHub service](https://pims.jupyter.ca) and for help with this notebook. -- J. Colliander

In [None]:
## Import some Python resources for data and interactive plots.
import pandas as pd
import matplotlib.pyplot as plt
from ipywidgets import widgets
import numpy as np
import seaborn as sns

import sys

from IPython.display import display, clear_output
from ipywidgets import widgets

sns.set_style("darkgrid")

%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6

In [None]:
## Import the tools for accessing data hosted on AWS S3.
import boto3
import botocore

## name the bucket containing the data
nsercBucket='pims-open-data'

s3 = boto3.client('s3')
exists = True

try:
    s3.head_bucket(Bucket=nsercBucket)
except botocore.exceptions.ClientError as e:
    # If a client error is thrown, then check that it was a 404 error.
    # If it was a 404 error, then the bucket does not exist.
    error_code = int(e.response['Error']['Code'])
    if error_code == 404:
        exists = False

In [None]:
## Bring in a selection of the NSERC awards data starting with 1995 and ending with 2014.
## Throw away as much as you can to keep the DataFrame small enough to manipulate using a laptop.

startYear=1995
endYear=2016  ## This means we include the 2015 collection but not 2016.

nserc = []
institutionAwards = []

s3 = boto3.resource('s3')
bucket = s3.Bucket(nsercBucket)

for year in range(startYear, endYear):
    obj = boto3.client('s3').get_object(
        Bucket=nsercBucket, Key='NSERC_GRT_FYR'+str(year)+'_AWARD.csv')
    df = pd.read_csv(obj['Body'], 
                     encoding='latin1', 
                     usecols = [1, 2, 3, 4, 5, 7, 9, 11, 12, 13, 28],
                    )
    nserc.append(df)
    df.columns = ['Name', 'Department', 'OrganizationID',
                 'Institution', 'ProvinceEN', 'CountryEN',
                 'FiscalYear', 'AwardAmount', 'ProgramID',
                 'ProgramNameEN','ResearchSubjectEN']   ## Rename various columns for easier access.
    print(year)

In [None]:
## Again, throw away some superfluous data to minimize impact on memory.
try:
    nsercDF = pd.concat(nserc)
    del(nserc)
except NameError:
    print("Namespace already cleaned")

print("DataFrame: {:4.2f} Mb".format(sys.getsizeof(nsercDF) / (1024. * 1024)))  ##Quantify data stored in memory.

In [None]:
## These are the columns in our data table.
nsercDF.columns

In [None]:
## This is what the data looks like.
nsercDF

## Total Invested by NSERC Over Time

We accumulate the award amounts into a total sum for each year and plot these values over time. These calculations do not take inflation or other factors into account.

In [None]:
awardTotals = nsercDF.groupby('FiscalYear').sum()['AwardAmount']

import matplotlib.ticker as mtick

fig = plt.figure()

ax = fig.add_subplot(111)

ax.plot(awardTotals.index, awardTotals/10**6)
ax.set_ylabel('Award Total ($/M)')
ax.set_xlabel('Year')

ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%4d'))

## 2014 Investments by `Institution`

Let's focus in on 2014. We accumulate all the awards for each institution and sort by the resulting totals. Then, we expose an ordered list of the institutions that received the biggest investments from NSERC.

In [None]:
byInstitution = nsercDF[nsercDF.FiscalYear == 2014].groupby('Institution')
top10 = byInstitution.sum()['AwardAmount'].sort_values(
    ascending=False).head(n=10)
top10

We set colors for these institutions to set up visualizations of the data.

In [None]:
institutionList = list(top10.index)
instColor = zip(institutionList, sns.color_palette())
institutionList

In [None]:
## Make a pie chart.
awards = nsercDF[nsercDF.FiscalYear == 2014].groupby(
    'Institution').sum().sort_values(
    'AwardAmount', ascending=False).head(n=10)
awards.loc[institutionList].plot.pie('AwardAmount',
                                       figsize=(8,8), legend=None)

In [None]:
## Make a bar chart.
a = awards.loc[institutionList]['AwardAmount']
a.plot.bar(color=sns.color_palette()) 

## 2015 Award Totals by `Province`

In [None]:
byProvince = nsercDF[nsercDF.FiscalYear == 2015].groupby('ProvinceEN')
provinceAmounts = byProvince.sum()['AwardAmount'].sort_values(
    ascending=False)
provinceAmounts

## 2015 Award Totals by `ProgramName`

In [None]:
byProgramName = nsercDF[nsercDF.FiscalYear == 2015].groupby('ProgramNameEN')
programNameAmounts = byProgramName.sum()['AwardAmount'].sort_values(
    ascending=False)
programNameAmounts

## Specific `Department` within an `Institution` over Time

**UBC Mathematics**

In [None]:
ubcMath = nsercDF.loc[(nsercDF['Department'].isin(['Mathematics'])) 
            & (nsercDF['Institution'].isin(['University of British Columbia']))].groupby('FiscalYear').sum()['AwardAmount']

fig = plt.figure()

ax = fig.add_subplot(111)

ax.plot(ubcMath.index, ubcMath/10**6)
ax.set_ylabel('Award Total ($/M)')
ax.set_xlabel('Year')

ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%4d'))

## Big Winners over Time

In [None]:
byName = nsercDF.loc[(nsercDF['AwardAmount'] > 1000000)].groupby('Name')
byName.sum().sort_values(
    'AwardAmount', ascending=False).head(n=50)

The first female researcher on the list of "Big NSERC Winners" during the 1996-2014 timeframe appears in position 33.  

## Individual Principal Investigator

In [None]:
nsercDF.loc[nsercDF['Name'].isin(['Vinet, Luc'])]

In [None]:
nsercDF.loc[nsercDF['Name'].isin(['Colliander, James'])]

In [None]:
nsercDF.loc[nsercDF['Name'].isin(['Hinton, Geoffrey'])]['AwardAmount'].sum()
## Total Amount Invested in CAD (not corrected for inflation)

In [None]:
nsercDF.loc[nsercDF['Name'].isin(['Hinton, Geoffrey'])]['AwardAmount'].plot(kind='bar')

## Exploring a Specific Program

In [None]:
nsercDF.loc[nsercDF['ProgramNameEN'].isin(['Canada Excellence Research Chairs'])]['AwardAmount'].sum()

In [None]:
nsercDF.loc[nsercDF['ProgramNameEN'].isin(['Canada Excellence Research Chairs'])]['AwardAmount'].plot(kind='bar')

## Exploring CTRMS Envelope

In [None]:
nsercDF.loc[nsercDF['ProgramID'].isin(['CTRMS'])]['AwardAmount'].sum()

In [None]:
nsercDF.loc[nsercDF['ProgramID'].isin(['CTRMS'])]

In [None]:
nsercDF.loc[(nsercDF['Department'].isin(['Statistics'])) & (nsercDF['FiscalYear'].isin([2013]) ) 
            & (nsercDF['ProgramID'].isin(['RGPIN']))]['AwardAmount'].sum()

In [None]:
mathstatsDF = nsercDF.loc[(nsercDF['Department'].isin(['Mathematics']) 
             | nsercDF['Department'].isin(['Statistics']) 
             | nsercDF['Department'].isin(['Mathematics (St. George Campus)'])) 
            & (nsercDF['FiscalYear'].isin([2015]) ) 
            & (nsercDF['ProgramID'].isin(['RGPIN']))]

In [None]:
mathstatsDF['AwardAmount'].sum()

In [None]:
mathstatsDF['AwardAmount'].plot(kind='hist')

In [None]:
nsercDF.loc[(nsercDF['Department'].isin(['Mathematics']) 
             | nsercDF['Department'].isin(['Statistics']) 
             | nsercDF['Department'].isin(['Mathematics (St. George Campus)'])) 
            & (nsercDF['FiscalYear'].isin([2012]) ) 
            & (nsercDF['ProgramID'].isin(['RGPIN']))]

In [None]:
nsercDF.loc[(nsercDF['Department'].isin(['Mathematics'])
            | nsercDF['Department'].isin(['Statistics'])
            | nsercDF['Department'].isin(['Mathematics (St. George Campus)'])) 
            & (nsercDF['FiscalYear'].isin([2012]) )
            & (nsercDF['ProgramID'].isin(['RGPIN']))
           ].describe()

In [None]:
msDG = nsercDF.loc[(nsercDF['Department'].isin(['Mathematics'])
            | nsercDF['Department'].isin(['Statistics'])
            | nsercDF['Department'].isin(['Mathematics (St. George Campus)'])
            | nsercDF['Department'].isin(['Mathematics (Toronto)'] )     ) 
           ]

In [None]:
timeMath = msDG.groupby('FiscalYear').sum()['AwardAmount']

fig = plt.figure()

ax = fig.add_subplot(111)

ax.plot(timeMath.index, timeMath/10**6)
ax.set_ylabel('Award Total ($/M)')
ax.set_xlabel('Year')

ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%4d'))

In [None]:
timeMath

In [None]:
msDG.describe()