# 0: Collect data 

This assums that you have downloaded projects data file from the [RePORTER](https://reporter.nih.gov/exporter/projects) website. Make sure that there is a data folder in the project root folder in order for this notebook to work. 

Check list:
- Make sure you have a `data` folder in root project folder
- Make sure that you have downloaded the RePORTER dataset and it is stored within the project root folder

This notebook will merge all dataset and will output a compressed one. 

In [1]:
from pathlib import Path
import pandas as pd

## Parameters
Here are the parameters used to run this notebook:
- `PREFIX`: {str} -> unique wildcard used to identify all files
- `DATA_PATH`: {str} -> path to where all the RePORTER files are stored
- `OUTDIR`: {str} -> Path to results directory
- `OUTNAME`: {str} -> output name of the generated merged file

In [10]:
PREFIX = "RePORTER"
DATA_PATH = "../data"
OUTDIR = "../results"
OUTNAME = "projects_2019-2022"

## Getting data file paths

In [11]:
# creating out directory
out_dir_path = Path(OUTDIR)
out_dir_path.mkdir(exist_ok=True)

# loading in the RePORTER dataset
data_path = Path(DATA_PATH)
all_files = [str(_path.absolute()) for _path in data_path.glob(f"{PREFIX}*.csv")]
all_files

['/Users/erikserrano/Development/Projects/NIH-Faculty-Search/notebooks/../data/RePORTER_PRJ_C_FY2019_new.csv',
 '/Users/erikserrano/Development/Projects/NIH-Faculty-Search/notebooks/../data/RePORTER_PRJ_C_FY2021.csv',
 '/Users/erikserrano/Development/Projects/NIH-Faculty-Search/notebooks/../data/RePORTER_PRJ_C_FY2020.csv']

## Loading data files into a pandas dataframe

The formatting of these csv files contains some lines that causes pandas tokenizer to fail
- we encode the lines, the `ignore` will remove any invalid characters that are not utf-8
- on_bad_lines is set to skip in order to prevent the C tokenization function from failing
- project_df contains all the project information from multiple files.
- this will store all dataframes and will be concatenated into a single cone

In [12]:
# The formatting of these csv files contains some lines that causes pandas tokenizer to fail
# -- we encode the lines, the `ignore` will remove any invalid characters that are not utf-8
# -- on_bad_lines is set to skip in order to prevent the C tokenization function from failing

# -- project_df contains all the project information from multiple files.
# -- this will store all dataframes and will be concatenated into a single cone
project_df = []
for _file in all_files:
    df = pd.read_csv(
        _file,
        encoding="utf-8",
        encoding_errors="ignore",
        on_bad_lines="skip",
    )
    project_df.append(df)

project_df = pd.concat(project_df)

# printing out df metadata
rows, columns = project_df.shape
print(f"MESSAGE: Dataframe loaded {rows} rows and {columns} columns")
print("WARNING: Some entries may be omitted due to content not being utf-8 compatible")
project_df.head()

  df = pd.read_csv(


MESSAGE: Dataframe loaded 244429 rows and 46 columns


Unnamed: 0,APPLICATION_ID,ACTIVITY,ADMINISTERING_IC,APPLICATION_TYPE,ARRA_FUNDED,AWARD_NOTICE_DATE,BUDGET_START,BUDGET_END,CFDA_CODE,CORE_PROJECT_NUM,...,SERIAL_NUMBER,STUDY_SECTION,STUDY_SECTION_NAME,SUBPROJECT_ID,SUFFIX,SUPPORT_YEAR,DIRECT_COST_AMT,INDIRECT_COST_AMT,TOTAL_COST,TOTAL_COST_SUB_PROJECT
0,9787485,UG3,DK,5.0,N,08/23/2019,08/01/2019,07/31/2020,847.0,UG3DK120004,...,120004.0,ZDK1,Special Emphasis Panel,,,2.0,1305190.0,209958.0,1170825.0,
1,9999888,R01,GM,7.0,N,01/27/2020,08/01/2019,04/30/2020,859.0,R01GM072562,...,72562.0,ICI,Intercellular Interactions Study Section,,,13.0,164377.0,84654.0,249031.0,
2,10002129,U42,OD,3.0,N,09/05/2019,09/09/2019,09/08/2020,310.0,U42OD026645,...,26645.0,ZRG1,Special Emphasis Panel,,S1,2.0,249195.0,145779.0,394974.0,
3,9698861,R01,AA,5.0,N,05/27/2019,06/01/2019,05/31/2021,273.0,R01AA023722,...,23722.0,HBPP,Hepatobiliary Pathophysiology Study Section,,,5.0,276377.0,161681.0,438058.0,
4,9658987,R01,GM,2.0,N,02/18/2019,03/01/2019,02/29/2020,859.0,R01GM106373,...,106373.0,GHD,Genetics of Health and Disease Study Section,,,5.0,255000.0,110664.0,365664.0,


In [13]:
# These are the given features along with metadata
project_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244429 entries, 0 to 82358
Data columns (total 46 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   APPLICATION_ID          244429 non-null  int64  
 1   ACTIVITY                244429 non-null  object 
 2   ADMINISTERING_IC        244429 non-null  object 
 3   APPLICATION_TYPE        240173 non-null  float64
 4   ARRA_FUNDED             244429 non-null  object 
 5   AWARD_NOTICE_DATE       230704 non-null  object 
 6   BUDGET_START            230760 non-null  object 
 7   BUDGET_END              230734 non-null  object 
 8   CFDA_CODE               194806 non-null  float64
 9   CORE_PROJECT_NUM        240538 non-null  object 
 10  ED_INST_TYPE            145283 non-null  object 
 11  FOA_NUMBER              230552 non-null  object 
 12  FULL_PROJECT_NUM        244429 non-null  object 
 13  FUNDING_ICs             237849 non-null  object 
 14  FUNDING_MECHANISM    

## Save combined file into compressed gz file

In [9]:
save_path = out_dir_path / f"{OUTNAME}.csv.gz"
project_df.to_csv(save_path, index=False, compression="gzip")