# Nature NIH conflict of interest investigation

Working process to determine the percentage of grants given to each institution that have a FCOI (financial conflict of interest) investigation associated with them. 

`43736 Reardon with redactions.xls`: NIH FOI document listing all conflict of interest reports that investigators filed in 2012 and 2013

**NIH Research Portfolio Online Reporting Tools (RePORT)** - A repository of NIH-funded research projects and access publications and patents resulting from that funding.

**ExPORTER Data Catalog** [http://exporter.nih.gov/ExPORTER_Catalog.aspx](http://exporter.nih.gov/ExPORTER_Catalog.aspx) - makes downloadable versions of the data accessed through the RePORT Expenditures and Results (RePORTER) interface available to the public. 

## Download and unzip NIH Reporter files relating to fiscal years 2011-14

In [1]:
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2014.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2013.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2012.zip
!curl -O http://exporter.nih.gov/CSVs/final/RePORTER_PRJ_C_FY2011.zip
    
!unzip RePORTER_PRJ_C_FY2014.zip 
!unzip RePORTER_PRJ_C_FY2013.zip
!unzip RePORTER_PRJ_C_FY2012.zip
!unzip RePORTER_PRJ_C_FY2011.zip

!rm RePORTER_PRJ_C_FY2014.zip
!rm RePORTER_PRJ_C_FY2013.zip
!rm RePORTER_PRJ_C_FY2012.zip
!rm RePORTER_PRJ_C_FY2011.zip

We should now have four CSV files relating to 2011,2012,2013 and 2014.

## Clean and combine the NIH reporter documents

Extract just the relevant columns from the four NIH documents into new documents and Convert `iso-8859-1` to `utf-8`.

In [29]:
!in2csv -e iso-8859-1 RePORTER_PRJ_C_FY2014.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > csv/nih_2014.csv
!in2csv -e iso-8859-1 RePORTER_PRJ_C_FY2013.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > csv/nih_2013.csv
!in2csv -e iso-8859-1 RePORTER_PRJ_C_FY2012.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > csv/nih_2012.csv
!in2csv -e iso-8859-1 RePORTER_PRJ_C_FY2011.csv | csvcut -c 'APPLICATION_ID, PROJECT_TITLE, PI_NAMEs, ORG_NAME, FUNDING_MECHANISM, FY, PROJECT_START, PROJECT_END, TOTAL_COST' > csv/nih_2011.csv

Combine all of the entries for all four years into one CSV file:

In [30]:
!csvstack csv/nih_2014.csv csv/nih_2013.csv csv/nih_2012.csv csv/nih_2011.csv > csv/nih_all.csv

In [31]:
!wc -l csv/nih_all.csv #Count the number of lines in the file

  315065 csv/nih_all.csv


In [32]:
!csvcut -n csv/nih_all.csv #Get a numbered list of the columns

  1: APPLICATION_ID
  2: PROJECT_TITLE
  3: PI_NAMEs
  4: ORG_NAME
  5: FUNDING_MECHANISM
  6: FY
  7: PROJECT_START
  8: PROJECT_END
  9: TOTAL_COST


The regulation does not apply to Phase I Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) applications/awards so remove all `SBIR-STTR` entires from the `FUNDING_MECHANISM` column.

In [33]:
!csvgrep -c 5 -r "SBIR-STTR" -i csv/nih_all.csv > csv/temp.csv && mv csv/temp.csv csv/nih_all.csv

In [34]:
!wc -l csv/nih_all.csv #Count the number of lines in the file

  308399 csv/nih_all.csv


## Remove redundant files

In [51]:
!rm RePORTER_PRJ_C_FY2014.csv
!rm RePORTER_PRJ_C_FY2013.csv
!rm RePORTER_PRJ_C_FY2012.csv
!rm RePORTER_PRJ_C_FY2011.csv
!rm csv/nih_2014.csv
!rm csv/nih_2013.csv
!rm csv/nih_2012.csv
!rm csv/nih_2011.csv

##Import pandas, pandasql and csvkit

In [6]:
import pandas
import pandasql
import csvkit

## Load the combined NIH csv into a pandas dataframe

In [35]:
nih_df = pandas.read_csv("./csv/nih_all.csv")
nih_df.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)

nih_org_series = nih_df['org_name'].value_counts()
nih_org_df = pandas.DataFrame(nih_org_series, columns=['total-grants'])

In [36]:
# Each pi_name ends with a semi-colon
# We'll need to remove these in order to merge with the FCOI names
print "Before: "
print nih_df['pi_names'].head()

nih_df['pi_names'] = nih_df['pi_names'].map(lambda x: str(x)[:-1])

print "\n"
print "After: "
print nih_df['pi_names'].head()

Before: 
0                                     BAR-SAGI, DAFNA;
1                                     BAR-SAGI, DAFNA;
2    ELLNER, JERROLD J. (contact);JOLOBA, MOSES LUT...
3    DESHPANDE, SMITA N;MANSOUR, HADER A.;NIMGAONKA...
4    TIRSCHWELL, DAVID L;ZUNT, JOSEPH RAYMOND (cont...
Name: pi_names, dtype: object


After: 
0                                      BAR-SAGI, DAFNA
1                                      BAR-SAGI, DAFNA
2    ELLNER, JERROLD J. (contact);JOLOBA, MOSES LUT...
3    DESHPANDE, SMITA N;MANSOUR, HADER A.;NIMGAONKA...
4    TIRSCHWELL, DAVID L;ZUNT, JOSEPH RAYMOND (cont...
Name: pi_names, dtype: object


In [37]:
# Count of grants per institution
nih_org_df.head(4)

Unnamed: 0,total-grants
JOHNS HOPKINS UNIVERSITY,6705
"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",6289
UNIVERSITY OF PENNSYLVANIA,5807
UNIVERSITY OF MICHIGAN,5336


In [38]:
nih_person_series = nih_df['pi_names'].value_counts()
nih_person_df = pandas.DataFrame(nih_person_series, columns=['total-grants'])

In [39]:
# Count of grants per person
nih_person_df.head(4)

Unnamed: 0,total-grants
",",1555
"HEIMBROOK, DAVID",235
"WEINER, GEORGE J",154
"CARBONE, MICHELE",114


## Load the FCOI data into a pandas dataframe

In [40]:
!in2csv --sheet "SQL Results" "ref/43736 Reardon with redactions.xls" > csv/fcoi.csv

In [42]:
!csvcut -n csv/fcoi.csv #Get a numbered list of the columns

  1: FCOI_ID
  2: PROJECT NUMBER
  3: PROJECT TITLE
  4: CONTACT PD/PI NAME
  5: FCOI INVESTIGATOR NAME
  6: SUBMITTING INSTITUTION
  7: FCOI REPORT TYPE
  8: FCOI AMOUNT VALUE
  9: FCOI REPORT STATUS


In [44]:
fcoi_df = pandas.read_csv("./csv/fcoi.csv")
fcoi_df.rename(columns = lambda x: x.replace(' ','_').lower(),  inplace=True)

fcoi_org_series = fcoi_df['submitting_institution'].value_counts()
fcoi_org_df = pandas.DataFrame(fcoi_org_series, columns=['total-fcoi-claims'])

fcoi_person_series = fcoi_df['contact_pd/pi_name'].value_counts()
fcoi_person_df = pandas.DataFrame(fcoi_person_series, columns=['totas-fcoi-claims'])

In [45]:
# Count of total fcoi claims per institution
fcoi_org_df.head(4)

Unnamed: 0,total-fcoi-claims
YALE UNIVERSITY,123
UNIVERSITY OF WISCONSIN-MADISON,106
UNIV OF NORTH CAROLINA CHAPEL HILL,88
MASSACHUSETTS GENERAL HOSPITAL,86


In [46]:
# Count of total fcoi claims per person
fcoi_person_df.head(4)

Unnamed: 0,totas-fcoi-claims
"SCOTT, INGRID U",52
"SHERWIN, ROBERT Stanley",46
"LACHIN, JOHN M",22
"Diller, Kenneth R.",20


## Join the NIH and FCOI data frames for comparison

In [47]:
# pd.merge(left_frame, right_frame, on='key', how='inner')
org_merge = pandas.merge(fcoi_org_df, nih_org_df, left_index=True, right_index=True, how='inner')

org_merge.to_csv("./csv/institution_fcoi_nih.csv")

org_merge.head()

Unnamed: 0,total-fcoi-claims,total-grants
JOHNS HOPKINS UNIVERSITY,66,6705
"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",2,6289
UNIVERSITY OF PENNSYLVANIA,65,5807
UNIVERSITY OF MICHIGAN,39,5336
UNIVERSITY OF WASHINGTON,58,5146


In [49]:
# Full outer join as there are lots of missmatches
person_merge = pandas.merge(fcoi_person_df, nih_person_df, left_index=True, right_index=True, how='outer')

person_merge.to_csv("./csv/person_fcoi_nih.csv")

person_merge.head()

Unnamed: 0,totas-fcoi-claims,total-grants
",",,1555
"AAGAARD-TILLERY, KJERSTI MARIE",,7
"AAKALU, VINAY",,1
"AALBERSBERG, WILLIAM",,1
"AALBERTS, DANIEL PAUL",,1


##Sum up the types of claims per institution


In [54]:
'''Cannot add up types of claims per institution because the FCOI SFI NATURE DESCRIPTION column is missing'''
# fcoi_type_of_claim_list = fcoi_df['fcoi_sfi_nature_description'].unique().tolist()

# len(fcoi_type_of_claim_list)


'Cannot add up types of claims per institution because the FCOI SFI NATURE DESCRIPTION column is missing'