# Nature NIH conflict of interest investigation

Working process to determine the percentage of grants given to each institution that have a FCOI (financial conflict of interest) investigation associated with them. 

`FCOI working doc.xlsx` - NIH FOI document listing all conflict of interest reports that investigators filed in 2012 and 2013

**NIH Research Portfolio Online Reporting Tools (RePORT)** - A repository of NIH-funded research projects and access publications and patents resulting from that funding.

**ExPORTER Data Catalog** [http://exporter.nih.gov/ExPORTER_Catalog.aspx](http://exporter.nih.gov/ExPORTER_Catalog.aspx) - makes downloadable versions of the data accessed through the RePORT Expenditures and Results (RePORTER) interface available to the public. 

## Download and unzip NIH Reporter files relating to fiscal years 2011-14

In [2]:
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2014.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2013.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2012.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2011.zip
    
!unzip RePORTER_PRJ_C_FY2014.zip 
!unzip RePORTER_PRJ_C_FY2013.zip
!unzip RePORTER_PRJ_C_FY2012.zip
!unzip RePORTER_PRJ_C_FY2011.zip

!rm RePORTER_PRJ_C_FY2014.zip
!rm RePORTER_PRJ_C_FY2013.zip
!rm RePORTER_PRJ_C_FY2012.zip
!rm RePORTER_PRJ_C_FY2011.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.6M  100 40.6M    0     0   388k      0  0:01:47  0:01:47 --:--:--  394k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.2M  100 40.2M    0     0   389k      0  0:01:45  0:01:45 --:--:--  394k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.2M  100 40.2M    0     0   384k      0  0:01:47  0:01:47 --:--:--  386k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 37.4M  100 37.4M    0     0   386k      0  0:01:39  0:01:39 --:--:--  393k
Archive:  RePORTER_PRJ_C_FY2014.zip
  inflating: ReP

We should now have four CSV files relating to 2011,2012,2013 and 2014.

## Clean and combine the NIH reporter documents

Extract just the relevant columns from the four NIH documents into new documents and Convert `iso-8859-1` to `utf-8`.

In [4]:
!in2csv -e iso-8859-1 ref/csv/RePORTER_PRJ_C_FY2014.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > nih_2014.csv
!in2csv -e iso-8859-1 ref/csv/RePORTER_PRJ_C_FY2013.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > nih_2013.csv
!in2csv -e iso-8859-1 ref/csv/RePORTER_PRJ_C_FY2012.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > nih_2012.csv
!in2csv -e iso-8859-1 ref/csv/RePORTER_PRJ_C_FY2011.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > nih_2011.csv

Combine all of the entries for all four years into one CSV file:

In [6]:
!csvstack nih_2014.csv nih_2013.csv nih_2012.csv nih_2011.csv > nih_all.csv

In [10]:
!wc -l nih_all.csv #Count the number of lines in the file

  315065 nih_all.csv


In [9]:
!csvcut -n nih_all.csv #Get a numbered list of the columns

  1: APPLICATION_ID
  2: PROJECT_TITLE
  3: PI_NAMEs
  4: ORG_NAME
  5: FUNDING_MECHANISM
  6: FY
  7: PROJECT_START
  8: PROJECT_END
  9: TOTAL_COST


The regulation does not apply to Phase I Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) applications/awards so remove all `SBIR-STTR` entires from the `FUNDING_MECHANISM` column.

In [15]:
!csvgrep -c 5 -r "SBIR-STTR" -i nih_all.csv > temp.csv && mv temp.csv nih_all.csv

In [16]:
!wc -l nih_all.csv #Count the number of lines in the file

  308399 nih_all.csv


## Load the combined NIH csv into a pandas dataframe

In [32]:
import pandas
import pandasql

nih_df = pandas.read_csv("./nih_all.csv")
nih_df.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

# Count the number of entires in the data frame
print len(nih_df.index)

nih_df.to_csv("./test.csv")

308398
